[기본이론] CME295 6강

기본기 다지기

[기본이론] CME295 6강

syveany 2026. 5. 21. 00:29

[vanilla LLM 강점]

- imitation, idea generation 굿

- 코드 generation debugging 굿

[vanilla LLM 약점]

- 논리적인 추론에 약함 <- 오늘 배울 내용

- 지식 업데이트 안 됨 <- 여기서부터는 다음 강의들에서

- 뭐 직접 실행시키기 불가

- 평가 어려움

1. Reasoning models

2. Scaling with RL

e.g.

Coding: HumanEval, CodeForces, SWE-bench

Math: AIME, GSM8K

3. GRPO

Group Relative Policy Optimization

grpo ppo

similarities: ratio, clipping

differences: KL penalty, advantage estimation

increasing output length phenomenon

해결책

DAPO 전체 token을 기준으로 normalization

Dr. GRPO: 아예 length normalization 제거

4. Applications

R1-Zero

장점: Reasoning able without SFT

단점: Reasoning은 하는데 formatting, readability issue가 있음. 사람 친화적 표현은 신경 안 쓰기 때문

[Step1]

[Step2] Small-scale SFT with reasoning data

지금 있는 reasoning example 조금 학습함

[Step3] GRPO with reasoning data. 다시 RL 돌림

Reward = formatting + accuracy + language consistency

여기까지 하면 대충 잘 하게됨. 근데 수학/코딩은 잘 푸는데 일반 대화나 설명은 부족할 수 있음

그래서 step4에서 데이터를 2종류(reasoning 600k, general 200k)를 넣음

Reasoning 데이터 만들 때는 rejection sampling이라는걸 씀. 자기가 스스로 데이터 생성해서 좋은거만 골라서 재학습

[Step4] Large-scale SFT with reasoning + non-reasoning data

[Step5] GRPO with reasoning + non-reasoning data

reasoning reward: formatting + accuracy

non-reasoning reward: helpfulness + harmless

---

1. In reasoning benchmarks, why is verification especially important for coding tasks?

A. Coding outputs are usually very short
B. Generated code can be automatically tested using test cases
C. Coding tasks never require reasoning
D. Human evaluation is impossible

Answer: B

2. Which statement best describes the role of verifiable rewards in reasoning RL?

A. They rely entirely on subjective human judgment
B. They provide automatically checkable signals for correctness
C. They remove the need for policy optimization
D. They only evaluate formatting quality

Answer: B

3. Why did the lecture emphasize that human-written reasoning chains may be limiting?

A. Humans cannot solve reasoning problems
B. Human-written CoTs are too computationally expensive
C. Models may discover reasoning strategies beyond human-written examples
D. Human-written data cannot be tokenized properly

Answer: C

'기본기 다지기' 카테고리의 다른 글

[기본이론] CME295 7강 (0)	2026.06.04
[기본이론] Nyquist Theorem 이란? (0)	2026.05.11
[수학] 중간고사 공부 (0)	2026.05.02
[기본이론] CME295 공부 (Lecture 5) (0)	2026.04.28
[기본이론] CME295 공부 (Lecture 4) (0)	2026.04.12

현재글[기본이론] CME295 6강

잡식성 학습블로그

Today :
Yesterday :

코테, 코딩테스트, 토이프로젝트, 최적화, 선형대수, 논문리뷰, CV, 파이썬, cs231n, paperreview, mml, optimization, Linear Algebra, 프로그래머스, AI,

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

.