Reinforcement Learning (RL) trains an agent to maximize long-term reward through interaction and feedback.
Core idea
- The model (agent) takes actions in an environment.
- The environment returns rewards.
- The agent updates its policy to increase expected cumulative reward.
Key concepts
- State: current situation seen by the agent.
- Action: decision made by the agent.
- Reward: scalar feedback signal.
- Policy: strategy that maps states to actions.
- Return: accumulated reward over time.
Objective
- Learn a policy that maximizes expected long-term return, not just immediate reward.
Generic loop
observe state
-> choose action (policy)
-> receive reward and next state
-> update policy
-> repeatRL in modern language-model pipelines
- RL is often used after SFT (Supervised Fine-Tuning).
- Common setup:
- model generates outputs,
- a reward signal evaluates output quality,
- policy optimization updates model behavior.
- This family of methods is commonly used to align generation behavior with desired preferences.
Difference from SFT
- SFT (Supervised Fine-Tuning) learns from fixed labeled targets.
- RL learns from reward signals and sequential decision effects.
- SFT is usually simpler and more stable; RL can optimize objectives not easily expressed as labels.
Benefits
- Can optimize non-differentiable or preference-like goals.
- Supports iterative behavior improvement from feedback.
Limitations
- More unstable and sensitive to reward design.
- Reward hacking risk if reward is poorly specified.
- Higher implementation and tuning complexity.
In this project context
- For MNLI classification with DistilBERT model, SFT (Supervised Fine-Tuning) is the primary training paradigm.
- RL is mainly conceptual here unless a reward-driven objective is introduced.