메뉴
BL
MarkTechPost 29일 전

SFT부터 DPO, GRPO까지: TRL을 활용한 LLM 후처리 튜토리얼

IMP
8/10
핵심 요약

이 튜토리얼은 강력한 TRL 라이브러리 생태계를 활용하여 대형 언어 모델(LLM)을 후처리하는 전체 과정을 코드와 함께 안내합니다. 가벼운 베이스 모델을 시작으로 SFT, 보상 모델링(RM), DPO, GRPO 등 4가지 핵심 기법을 점진적으로 적용하며 모델의 정렬(alignment) 파이프라인을 구축하는 방법을 다룹니다. LoRA와 같은 효율적인 기법을 사용하여 구글 코랩(Colab) T4 GPU 같은 제한된 하드웨어 환경에서도 실습할 수 있도록 구성되어 있다는 점이 특징입니다.

번역된 본문

인공지능 AI 인프라 기술 에디터 추천 언어 모델 스태프 튜토리얼

이 튜토리얼에서는 강력한 TRL(Transformer Reinforcement Learning) 라이브러리 생태계를 사용하여 대형 언어 모델을 후처리하는 완벽한 실습 과정을 안내합니다. 가벼운 베이스 모델에서 시작하여 지도 파인튜닝(Supervised Fine-Tuning, SFT), 보상 모델링(Reward Modeling, RM), 직접 선호 최적화(Direct Preference Optimization, DPO), 그룹 상대 정책 최적화(Group Relative Policy Optimization, GRPO)의 4가지 핵심 기술을 점진적으로 적용해 봅니다. 또한 LoRA와 같은 효율적인 방법을 활용하여 구글 코랩(Colab)의 T4 GPU와 같은 제한된 하드웨어에서도 훈련이 가능하도록 만듭니다. 한 단계씩 나아가면서 모델이 응답하는 방법을 가르치는 것부터 선호도와 검증 가능한 보상을 사용하여 동작을 형성하는 것까지, 최신 정렬 파이프라인이 어떻게 작동하는지 직관을 구축할 것입니다.

코드 복사됨 다른 브라우저 사용

import subprocess, sys subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "-U", "torchao>=0.16", "trl>=0.20", "transformers>=4.45", "datasets", "peft>=0.13", "accelerate", "bitsandbytes", ]) import sys as _sys for _m in [m for m in list(_sys.modules) if m.startswith(("torchao", "peft"))]: _sys.modules.pop(_m, None) try: import torchao except Exception: import types _fake = types.ModuleType("torchao") _fake.version = "0.16.1" _sys.modules["torchao"] = _fake import os, re, gc, torch, warnings warnings.filterwarnings("ignore") os.environ["TOKENIZERS_PARALLELISM"] = "false" os.environ["WANDB_DISABLED"] = "true" os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1" from datasets import load_dataset, Dataset from transformers import AutoTokenizer, AutoModelForCausalLM from peft import LoraConfig print(f"torch={torch.version} cuda={torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"GPU: {torch.cuda.get_device_name(0)} " f"({torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB)") MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct" DEVICE = "cuda" if torch.cuda.is_available() else "cpu" BF16_OK = torch.cuda.is_available() and torch.cuda.is_bf16_supported() LORA_CFG = LoraConfig( r=8, lora_alpha=16, lora_dropout=0.05, bias="none", target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], task_type="CAUSAL_LM", ) def cleanup(): """훈련 단계 사이에 VRAM을 해제합니다 (코랩 T4 환경은 메모리가 부족하기 때문입니다).""" gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache() def chat_generate(model, tokenizer, prompt, max_new_tokens=120): """헬퍼: 채팅 형식으로 지정하고, 생성한 뒤 어시스턴트 턴만 디코딩합니다.""" msgs = [{"role": "user", "content": prompt}] ids = tokenizer.apply_chat_template( msgs, return_tensors="pt", add_generation_prompt=True ).to(model.device) with torch.no_grad(): out = model.generate( ids, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id, ) return tokenizer.decode(out[0][ids.shape[-1]:], skip_special_tokens=True)

TRL(Transformer Reinforcement Learning 라이브러리), Transformers, PEFT와 같은 라이브러리 간의 호환성을 보장하며 전체 훈련 스택을 설치하고 구성합니다. 환경 변수와 GPU 검사를 설정하고, LoRA 구성 및 헬퍼 함수와 같은 재사용 가능한 구성 요소를 정의합니다. 또한 이후의 모든 단계를 지원하기 위해 메모리 정리 및 채팅 스타일 생성을 위한 유틸리티 함수도 준비합니다.

코드 복사됨 다른 브라우저 사용

print("\n" + "="*72 + "\nPART 1 — 지도 파인튜닝 (Supervised Fine-Tuning, SFT)\n" + "="*72) from trl import SFTTrainer, SFTConfig sft_ds = load_dataset("trl-lib/Capybara", split="train[:300]") print(f"SFT 데이터셋 행 수: {len(sft_ds)}") print(f"예시 메시지: {sft_ds[0]['messages'][:1]}") sft_args = SFTConfig( output_dir="./sft_out", num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=2e-4, logging_steps=10, save_strategy="no", bf16=BF16_OK, fp16=not BF16_OK, max_length=768, gradient_checkpointing=True, report_to="none", ) sft_trainer = SFTTrainer( model=MODEL_NAME, args=sft_args, train_dataset=sft_ds, peft_config=LORA_CFG, )

원문 보기
원문 보기 (영어)
Artificial Intelligence AI Infrastructure Technology Editors Pick Language Model Staff Tutorials In this tutorial, we walk through a complete, hands-on journey of post-training large language models using the powerful TRL (Transformer Reinforcement Learning) library ecosystem. We start from a lightweight base model and progressively apply four key techniques: Supervised Fine-Tuning (SFT), Reward Modeling (RM), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). Also, we leverage efficient methods like LoRA to make training feasible even on limited hardware, such as Google Colab’s T4 GPU. As we move step by step, we build intuition for how modern alignment pipelines work, from teaching models how to respond to shaping their behavior using preferences and verifiable rewards. Copy Code Copied Use a different Browser import subprocess, sys subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "-U", "torchao>=0.16", "trl>=0.20", "transformers>=4.45", "datasets", "peft>=0.13", "accelerate", "bitsandbytes", ]) import sys as _sys for _m in [m for m in list(_sys.modules) if m.startswith(("torchao", "peft"))]: _sys.modules.pop(_m, None) try: import torchao except Exception: import types _fake = types.ModuleType("torchao") _fake.__version__ = "0.16.1" _sys.modules["torchao"] = _fake import os, re, gc, torch, warnings warnings.filterwarnings("ignore") os.environ["TOKENIZERS_PARALLELISM"] = "false" os.environ["WANDB_DISABLED"] = "true" os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1" from datasets import load_dataset, Dataset from transformers import AutoTokenizer, AutoModelForCausalLM from peft import LoraConfig print(f"torch={torch.__version__} cuda={torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"GPU: {torch.cuda.get_device_name(0)} " f"({torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB)") MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct" DEVICE = "cuda" if torch.cuda.is_available() else "cpu" BF16_OK = torch.cuda.is_available() and torch.cuda.is_bf16_supported() LORA_CFG = LoraConfig( r=8, lora_alpha=16, lora_dropout=0.05, bias="none", target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], task_type="CAUSAL_LM", ) def cleanup(): """Release VRAM between training stages (Colab T4 is tight).""" gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache() def chat_generate(model, tokenizer, prompt, max_new_tokens=120): """Helper: format as chat, generate, decode just the assistant turn.""" msgs = [{"role": "user", "content": prompt}] ids = tokenizer.apply_chat_template( msgs, return_tensors="pt", add_generation_prompt=True ).to(model.device) with torch.no_grad(): out = model.generate( ids, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id, ) return tokenizer.decode(out[0][ids.shape[-1]:], skip_special_tokens=True) We install and configure the full training stack, ensuring compatibility across libraries like TRL (Transformer Reinforcement Learning library), Transformers, and PEFT. We set up environment variables and GPU checks, and define reusable components such as LoRA configuration and helper functions. We also prepare utility functions for memory cleanup and chat-style generation to support all later stages. Copy Code Copied Use a different Browser print("\n" + "="*72 + "\nPART 1 — Supervised Fine-Tuning (SFT)\n" + "="*72) from trl import SFTTrainer, SFTConfig sft_ds = load_dataset("trl-lib/Capybara", split="train[:300]") print(f"SFT dataset rows: {len(sft_ds)}") print(f"Example messages: {sft_ds[0]['messages'][:1]}") sft_args = SFTConfig( output_dir="./sft_out", num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=2e-4, logging_steps=10, save_strategy="no", bf16=BF16_OK, fp16=not BF16_OK, max_length=768, gradient_checkpointing=True, report_to="none", ) sft_trainer = SFTTrainer( model=MODEL_NAME, args=sft_args, train_dataset=sft_ds, peft_config=LORA_CFG, ) sft_trainer.train() print("\n[SFT inference]") print("Q: Explain the bias-variance tradeoff in two sentences.") print("A:", chat_generate(sft_trainer.model, sft_trainer.processing_class, "Explain the bias-variance tradeoff in two sentences.")) sft_trainer.save_model("./sft_out/final") del sft_trainer; cleanup() We begin by supervised fine-tuning, loading a conversational dataset, and configuring the SFT trainer. We train the model to imitate high-quality responses using LoRA for efficient adaptation on limited hardware. We then validate the model’s behavior through inference to confirm it follows instruction-style outputs. Copy Code Copied Use a different Browser print("\n" + "="*72 + "\nPART 2 — Reward Modeling\n" + "="*72) from trl import RewardTrainer, RewardConfig rm_ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:300]") print(f"RM dataset rows: {len(rm_ds)} keys: {list(rm_ds[0].keys())}") rm_args = RewardConfig( output_dir="./rm_out", num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=2, learning_rate=1e-4, logging_steps=10, save_strategy="no", bf16=BF16_OK, fp16=not BF16_OK, max_length=512, gradient_checkpointing=True, report_to="none", ) rm_lora = LoraConfig( r=8, lora_alpha=16, lora_dropout=0.05, bias="none", target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], task_type="SEQ_CLS", ) rm_trainer = RewardTrainer( model=MODEL_NAME, args=rm_args, train_dataset=rm_ds, peft_config=rm_lora, ) rm_trainer.train() del rm_trainer; cleanup() We move to reward modeling, where we train a model to score responses based on human preference data. We configure a sequence classification setup and train using chosen vs rejected pairs. This stage helps us learn a reward signal that can guide alignment in later methods. Copy Code Copied Use a different Browser print("\n" + "="*72 + "\nPART 3 — Direct Preference Optimization (DPO)\n" + "="*72) from trl import DPOTrainer, DPOConfig dpo_ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:300]") dpo_args = DPOConfig( output_dir="./dpo_out", num_train_epochs=1, per_device_train_batch_size=1, gradient_accumulation_steps=4, learning_rate=5e-6, logging_steps=10, save_strategy="no", bf16=BF16_OK, fp16=not BF16_OK, max_length=512, max_prompt_length=256, beta=0.1, gradient_checkpointing=True, report_to="none", ) dpo_trainer = DPOTrainer( model=MODEL_NAME, args=dpo_args, train_dataset=dpo_ds, peft_config=LORA_CFG, ) dpo_trainer.train() del dpo_trainer; cleanup() We implement Direct Preference Optimization to directly optimize the model using preference data without needing a separate reward model. We configure a low learning rate and control divergence using the beta parameter. We train the model to efficiently align its outputs with preferred responses. Copy Code Copied Use a different Browser print("\n" + "="*72 + "\nPART 4 — GRPO with verifiable math rewards\n" + "="*72) from trl import GRPOTrainer, GRPOConfig import random random.seed(0) def make_math_problem(): a, b = random.randint(1, 50), random.randint(1, 50) op = random.choice(["+", "-", "*"]) expr = f"{a} {op} {b}" return { "prompt": f"Solve this and end your reply with only the final number. {expr} =", "answer": str(eval(expr)), } grpo_ds = Dataset.from_list([make_math_problem() for _ in range(200)]) print(f"GRPO dataset rows: {len(grpo_ds)}") print(f"Example: {grpo_ds[0]}") def correctness_reward(completions, **kwargs): """+1 if the last number in the completion matches the gold answer.""" answers = kwargs["answer"] rewards = [] for c, gold in zip(completions, answers): nums = re.findall(r"-?\d+", c) rewards.append(1.0 if nums and nums[-1] == gold else 0.0) return rewards def brevity_reward(completions, **kwargs): """Small bonus for short answers — discourages rambling.""" return [max(0.0, 1.0 - len(c) / 200) * 0.2 for c in completions] grpo_args = GRPOConfig( output_dir="./grpo_out", learning_rate=1e-5, per_device_train_batch_size=2, gradient_accumulation_steps=2, num_gener