Learning Path: From Zero to Forge
No prior AI knowledge needed. Follow the phases in order — each one unlocks the next.
End goal: understand what Forge does and why every decision was made.
Transformer
Architecture that reads text using attention: every word looks at every other word at the same time.
All modern LLMs are built on this (2017).
BERT
Encoder-only Transformer pretrained on massive amounts of text.
pretrained = already learned general language understanding
fine-tuning = adapt it for your specific task
DistilBERT
Smaller, faster version of BERT.
Same quality, 40% smaller, 60% faster.
Created via knowledge distillation — a student model that learned from BERT as its teacher.
Classification
Predict a label given text input.
Example — MNLI:
- Premise: A man is eating pizza
- Hypothesis: Someone is eating food
- Label: entailment
→ MNLI
Translation
Convert text from one language to another.
Forge: English → French (Europarl dataset)
Model reads an EN sentence and generates the FR equivalent token by token.
Metric: BLEU score.
OCR / Document AI
Extract structured text from images of documents.
Forge: scan a PDF page → markdown text.
Requires a Vision-Language Model (VLM) — processes both images and text as input.
SFT — Fine-Tuning
Take a pretrained model. Train it on labelled examples for your task.
Loss = cross-entropy: how wrong is the predicted label?
Simplest approach. Always start here.
→ SFT (Supervised Fine-Tuning)
KD — Knowledge Distillation
Train a small student by learning from a large teacher.
Student learns not just right answers, but the teacher's uncertainty (soft probabilities).
Goal: small model ≈ teacher quality.
→ KD (Knowledge Distillation)
RL — Reinforcement Learning
Instead of fixed labels, the model learns from a reward signal.
Generate output → score it → update toward higher scores (GRPO).
Used for OCR quality and translation fluency.
→ RL (Reinforcement Learning )
Environment + First Run
Activate env, run a smoke test:
python forge.py \
--config configs/classification/
mnli/distilbert/00_sft.yaml \
--max_steps 10
YAML Config System
Every experiment = one YAML file.
task: what dataset/problem to use
model.name: which model
alpha: 0.0 → SFT (no distillation)
alpha: 0.5 → KD active
rl: block → enables RL mode
Configs inherit from parent files.
Reading Outputs
After a run, check:
output.log — what happened step by step
run_metadata.json — run info
training_state.json — training progress
checkpoint-N/ — saved model weights
KD Loss Strategies
9 pluggable distillation loss functions:
· Response-based — compare output logits
· Feature-based — compare hidden states
· Attention-based — compare attention weights
FAR factorial study: which combination works best per task?
Workbench Analysis
After training, analyse what the model actually learned:
Linear Probing — which layer learns task info?
CKA — does student align with teacher?
Disagreement — where does student diverge from teacher?
SLURM / HPC Scale
For multi-hour/day training runs on a cluster:
sbatch run_training.sh configs/...
Auto-saves checkpoint on timeout.
Auto-resumes from where it left off.
No babysitting needed.
What is AI / ML?
Teach computers to find patterns in data — instead of writing rules by hand.
model = a function: input → output
training = adjusting the model to be more accurate
Neural Networks
Layers of numbers (neurons) that transform an input step by step.
Each layer learns a different feature of the data.
Deep learning = networks with many layers.
Key Vocab
weights — numbers inside the model that get adjusted
loss — how wrong the model is (lower = better)
epoch — one full pass over all training data
step — one weight update
inference — using a trained model to predict