Phase 1 · AI Basics
Phase 2 · Language Models
Phase 3 · NLP Tasks
Phase 4 · Training Paradigms
Phase 5 · Forge in Practice
Phase 6 · Go Deeper

Learning Path: From Zero to Forge

No prior AI knowledge needed. Follow the phases in order — each one unlocks the next.

End goal: understand what Forge does and why every decision was made.

Transformer

Architecture that reads text using attention: every word looks at every other word at the same time.

All modern LLMs are built on this (2017).

Transformer

BERT

Encoder-only Transformer pretrained on massive amounts of text.

pretrained = already learned general language understanding
fine-tuning = adapt it for your specific task

BERT model

DistilBERT

Smaller, faster version of BERT.
Same quality, 40% smaller, 60% faster.

Created via knowledge distillation — a student model that learned from BERT as its teacher.

DistilBERT model

Classification

Predict a label given text input.

Example — MNLI:

  • Premise: A man is eating pizza
  • Hypothesis: Someone is eating food
  • Label: entailment

MNLI

Translation

Convert text from one language to another.

Forge: English → French (Europarl dataset)

Model reads an EN sentence and generates the FR equivalent token by token.
Metric: BLEU score.

OCR / Document AI

Extract structured text from images of documents.

Forge: scan a PDF page → markdown text.

Requires a Vision-Language Model (VLM) — processes both images and text as input.

SFT — Fine-Tuning

Take a pretrained model. Train it on labelled examples for your task.

Loss = cross-entropy: how wrong is the predicted label?

Simplest approach. Always start here.
SFT (Supervised Fine-Tuning)

KD — Knowledge Distillation

Train a small student by learning from a large teacher.

Student learns not just right answers, but the teacher's uncertainty (soft probabilities).

Goal: small model ≈ teacher quality.
KD (Knowledge Distillation)

RL — Reinforcement Learning

Instead of fixed labels, the model learns from a reward signal.

Generate output → score it → update toward higher scores (GRPO).

Used for OCR quality and translation fluency.
RL (Reinforcement Learning )

Environment + First Run

Activate env, run a smoke test:

python forge.py \
  --config configs/classification/
  mnli/distilbert/00_sft.yaml \
  --max_steps 10

00.Environment Setup

YAML Config System

Every experiment = one YAML file.

task: what dataset/problem to use
model.name: which model
alpha: 0.0 → SFT (no distillation)
alpha: 0.5 → KD active
rl: block → enables RL mode

Configs inherit from parent files.

Reading Outputs

After a run, check:

output.log — what happened step by step
run_metadata.json — run info
training_state.json — training progress
checkpoint-N/ — saved model weights

01.First Run Explained

KD Loss Strategies

9 pluggable distillation loss functions:

· Response-based — compare output logits
· Feature-based — compare hidden states
· Attention-based — compare attention weights

FAR factorial study: which combination works best per task?

Workbench Analysis

After training, analyse what the model actually learned:

Linear Probing — which layer learns task info?
CKA — does student align with teacher?
Disagreement — where does student diverge from teacher?

SLURM / HPC Scale

For multi-hour/day training runs on a cluster:

sbatch run_training.sh configs/...

Auto-saves checkpoint on timeout.
Auto-resumes from where it left off.
No babysitting needed.

What is AI / ML?

Teach computers to find patterns in data — instead of writing rules by hand.

model = a function: input → output
training = adjusting the model to be more accurate

Neural Networks

Layers of numbers (neurons) that transform an input step by step.

Each layer learns a different feature of the data.

Deep learning = networks with many layers.

Key Vocab

weights — numbers inside the model that get adjusted
loss — how wrong the model is (lower = better)
epoch — one full pass over all training data
step — one weight update
inference — using a trained model to predict