기본기 다지기

[기본이론] CME295 6강

syveany 2026. 5. 21. 00:29

[vanilla LLM 강점]

- imitation, idea generation 굿

- 코드 generation debugging 굿

 

[vanilla LLM 약점]

- 논리적인 추론에 약함 <- 오늘 배울 내용

- 지식 업데이트 안 됨 <- 여기서부터는 다음 강의들에서

- 뭐 직접 실행시키기 불가

- 평가 어려움

1. Reasoning models

 

 

2. Scaling with RL

e.g.

Coding: HumanEval, CodeForces, SWE-bench

Math: AIME, GSM8K

 

3. GRPO

Group Relative Policy Optimization

 

grpo ppo

similarities: ratio, clipping

differences: KL penalty, advantage estimation

 

increasing output length phenomenon

해결책

DAPO 전체 token을 기준으로 normalization

Dr. GRPO: 아예 length normalization 제거

4. Applications

 

R1-Zero

장점: Reasoning able without SFT

단점: Reasoning은 하는데 formatting, readability issue가 있음. 사람 친화적 표현은 신경 안 쓰기 때문

 

R1

[Step1] 

[Step2] Small-scale SFT with reasoning data

    지금 있는 reasoning example 조금 학습함

[Step3] GRPO with reasoning data. 다시 RL 돌림

    Reward = formatting + accuracy + language consistency

여기까지 하면 대충 잘 하게됨. 근데 수학/코딩은 잘 푸는데 일반 대화나 설명은 부족할 수 있음

그래서 step4에서 데이터를 2종류(reasoning 600k, general 200k)를 넣음

Reasoning 데이터 만들 때는 rejection sampling이라는걸 씀. 자기가 스스로 데이터 생성해서 좋은거만 골라서 재학습

[Step4] Large-scale SFT with reasoning + non-reasoning data

[Step5] GRPO with reasoning + non-reasoning data

    reasoning reward: formatting + accuracy

    non-reasoning reward: helpfulness + harmless

 

 

---

 

1. In reasoning benchmarks, why is verification especially important for coding tasks?

A. Coding outputs are usually very short
B. Generated code can be automatically tested using test cases
C. Coding tasks never require reasoning
D. Human evaluation is impossible

Answer: B

 

2. Which statement best describes the role of verifiable rewards in reasoning RL?

A. They rely entirely on subjective human judgment
B. They provide automatically checkable signals for correctness
C. They remove the need for policy optimization
D. They only evaluate formatting quality

Answer: B

 

3. Why did the lecture emphasize that human-written reasoning chains may be limiting?

A. Humans cannot solve reasoning problems
B. Human-written CoTs are too computationally expensive
C. Models may discover reasoning strategies beyond human-written examples
D. Human-written data cannot be tokenized properly

Answer: C

 

 

 

 

--