RL (Reinforcement Learning )

Reinforcement Learning (RL) trains an agent to maximize long-term reward through interaction and feedback.

Core idea

The model (agent) takes actions in an environment.
The environment returns rewards.
The agent updates its policy to increase expected cumulative reward.

Key concepts

State: current situation seen by the agent.
Action: decision made by the agent.
Reward: scalar feedback signal.
Policy: strategy that maps states to actions.
Return: accumulated reward over time.

Objective

Learn a policy that maximizes expected long-term return, not just immediate reward.

Generic loop

observe state
  -> choose action (policy)
  -> receive reward and next state
  -> update policy
  -> repeat

RL in modern language-model pipelines

RL is often used after SFT (Supervised Fine-Tuning).
Common setup:
- model generates outputs,
- a reward signal evaluates output quality,
- policy optimization updates model behavior.
This family of methods is commonly used to align generation behavior with desired preferences.

Difference from SFT

SFT (Supervised Fine-Tuning) learns from fixed labeled targets.
RL learns from reward signals and sequential decision effects.
SFT is usually simpler and more stable; RL can optimize objectives not easily expressed as labels.

Benefits

Can optimize non-differentiable or preference-like goals.
Supports iterative behavior improvement from feedback.

Limitations

More unstable and sensitive to reward design.
Reward hacking risk if reward is poorly specified.
Higher implementation and tuning complexity.

In this project context

For MNLI classification with DistilBERT model, SFT (Supervised Fine-Tuning) is the primary training paradigm.
RL is mainly conceptual here unless a reward-driven objective is introduced.

Notes

Explorer

RL (Reinforcement Learning )

Core idea

Key concepts

Objective

Generic loop

RL in modern language-model pipelines

Difference from SFT

Benefits

Limitations

In this project context

Table of Contents

Graph View

Table of Contents

Backlinks

Notes

Explorer

RL (Reinforcement Learning )

Core idea

Key concepts

Objective

Generic loop

RL in modern language-model pipelines

Difference from SFT

Benefits

Limitations

In this project context

Related notes

Graph View

Table of Contents

Backlinks