[vanilla LLM 강점]
- imitation, idea generation 굿
- 코드 generation debugging 굿
[vanilla LLM 약점]
- 논리적인 추론에 약함 <- 오늘 배울 내용
- 지식 업데이트 안 됨 <- 여기서부터는 다음 강의들에서
- 뭐 직접 실행시키기 불가
- 평가 어려움
1. Reasoning models
2. Scaling with RL
e.g.
Coding: HumanEval, CodeForces, SWE-bench
Math: AIME, GSM8K
3. GRPO
Group Relative Policy Optimization
grpo ppo
similarities: ratio, clipping
differences: KL penalty, advantage estimation
increasing output length phenomenon
해결책
DAPO 전체 token을 기준으로 normalization

Dr. GRPO: 아예 length normalization 제거

4. Applications
R1-Zero
장점: Reasoning able without SFT
단점: Reasoning은 하는데 formatting, readability issue가 있음. 사람 친화적 표현은 신경 안 쓰기 때문
R1
[Step1]
[Step2] Small-scale SFT with reasoning data
지금 있는 reasoning example 조금 학습함
[Step3] GRPO with reasoning data. 다시 RL 돌림
Reward = formatting + accuracy + language consistency
여기까지 하면 대충 잘 하게됨. 근데 수학/코딩은 잘 푸는데 일반 대화나 설명은 부족할 수 있음
그래서 step4에서 데이터를 2종류(reasoning 600k, general 200k)를 넣음
Reasoning 데이터 만들 때는 rejection sampling이라는걸 씀. 자기가 스스로 데이터 생성해서 좋은거만 골라서 재학습
[Step4] Large-scale SFT with reasoning + non-reasoning data
[Step5] GRPO with reasoning + non-reasoning data
reasoning reward: formatting + accuracy
non-reasoning reward: helpfulness + harmless
---
1. In reasoning benchmarks, why is verification especially important for coding tasks?
A. Coding outputs are usually very short
B. Generated code can be automatically tested using test cases
C. Coding tasks never require reasoning
D. Human evaluation is impossible
Answer: B
2. Which statement best describes the role of verifiable rewards in reasoning RL?
A. They rely entirely on subjective human judgment
B. They provide automatically checkable signals for correctness
C. They remove the need for policy optimization
D. They only evaluate formatting quality
Answer: B
3. Why did the lecture emphasize that human-written reasoning chains may be limiting?
A. Humans cannot solve reasoning problems
B. Human-written CoTs are too computationally expensive
C. Models may discover reasoning strategies beyond human-written examples
D. Human-written data cannot be tokenized properly
Answer: C
--
'기본기 다지기' 카테고리의 다른 글
| [기본이론] CME295 7강 (0) | 2026.06.04 |
|---|---|
| [기본이론] Nyquist Theorem 이란? (0) | 2026.05.11 |
| [수학] 중간고사 공부 (0) | 2026.05.02 |
| [기본이론] CME295 공부 (Lecture 5) (0) | 2026.04.28 |
| [기본이론] CME295 공부 (Lecture 4) (0) | 2026.04.12 |