Reinforcement Learning (RL) trains an agent to maximize long-term reward through interaction and feedback.

Core idea

  • The model (agent) takes actions in an environment.
  • The environment returns rewards.
  • The agent updates its policy to increase expected cumulative reward.

Key concepts

  • State: current situation seen by the agent.
  • Action: decision made by the agent.
  • Reward: scalar feedback signal.
  • Policy: strategy that maps states to actions.
  • Return: accumulated reward over time.

Objective

  • Learn a policy that maximizes expected long-term return, not just immediate reward.

Generic loop

observe state
  -> choose action (policy)
  -> receive reward and next state
  -> update policy
  -> repeat

RL in modern language-model pipelines

  • RL is often used after SFT (Supervised Fine-Tuning).
  • Common setup:
    • model generates outputs,
    • a reward signal evaluates output quality,
    • policy optimization updates model behavior.
  • This family of methods is commonly used to align generation behavior with desired preferences.

Difference from SFT

  • SFT (Supervised Fine-Tuning) learns from fixed labeled targets.
  • RL learns from reward signals and sequential decision effects.
  • SFT is usually simpler and more stable; RL can optimize objectives not easily expressed as labels.

Benefits

  • Can optimize non-differentiable or preference-like goals.
  • Supports iterative behavior improvement from feedback.

Limitations

  • More unstable and sensitive to reward design.
  • Reward hacking risk if reward is poorly specified.
  • Higher implementation and tuning complexity.

In this project context