Forge Learning Plan
This document is the recommended learning path for this repo.
Goal:
- Understand what Forge does
- Learn the minimum ML / model / infra concepts needed for this codebase
- Get to a first small PR without getting lost
Recommended Order
1. Run the project first
- Finish environment setup
- Run
python test_setup.py - Run a smoke test with
--max_steps 10 - Confirm you know:
- where a run starts
- where outputs are written
- why the run stops
Current example run:
python forge.py --config configs/classification/mnli/distilbert/00_sft.yaml --max_steps 10
Current output directory:
outputs/classification/mnli/distilbert/00_sft/20260310-0138Z
2. Build the high-level Forge mental model
- Learn the high-level pipeline:
config -> dataset -> tokenizer/processor -> model(s) -> trainer -> checkpoints/logs -> evaluation -> outputs
- Understand that Forge is a training framework around Hugging Face models
- Understand that the config chooses:
- task
- model
- training mode
Important:
- Do not read line-by-line yet
- First understand the flow
3. Learn the core terminology
-
run -
step -
epoch -
batch -
checkpoint -
resume -
task -
dataset -
model -
trainer -
pretrained model -
fine-tuning
Minimum definitions to keep in mind:
run= one experiment executionstep= one optimizer updateepoch= one full pass over the training settask= what problem you are solvingmodel= the neural network being trainedtraining mode= SFT / KD / RL
4. Learn the three most important separations
-
taskvsmodel -
modelvstraining mode -
checkpointvsfinal model
For this repo, this separation is critical:
MNLI= task / datasetDistilBERT= modelSFT= training mode
5. Learn the current concrete example first
Start with this config:
configs/classification/mnli/distilbert/00_sft.yaml
Checklist:
- I know this is an
MNLItask - I know this uses
DistilBERT - I know this run is CE-only SFT
- I know
alpha = 0.0means no active distillation - I know
--max_steps 10overrode full training
What this config means in plain English:
- Start from a pretrained DistilBERT model
- Fine-tune it on the MNLI classification task
- Predict the relationship between a premise and a hypothesis
What To Learn First
A. Hugging Face basics
- What the Hugging Face Model Hub is
- What the Hugging Face Datasets Hub is
- What
transformersprovides - What
datasetsprovides - How
from_pretrained(...)works
Concrete examples from this repo:
distilbert-base-uncasedbert-base-uncasedglue / mnli
B. Minimum ML concepts
- Supervised learning
- Classification
- Training loop:
- forward pass
- loss
- backward pass
- optimizer step
- Cross-entropy
- Accuracy / F1 / MCC
Do not learn everything first.
Only learn enough to explain what this run is doing.
C. Model concepts used by this repo
- What a Transformer model is
- What an encoder model is
- Why BERT / DistilBERT are encoder models
- What tokenization does
- What model inputs and outputs look like
For the current run, you should be able to explain:
- input = premise + hypothesis
- output = 3-way classification logits
D. Forge-specific usage
- YAML-driven configuration
- Config inheritance via
inherit -
--configvs--resume - Output directory structure
- Why
run_metadata.json,training_state.json, andoutput.logmatter
Learning Path By File
Phase 1: Files to understand first
-
README.md -
configs/classification/mnli/distilbert/00_sft.yaml -
configs/classification/mnli/distilbert/base.yaml -
configs/classification/mnli/base.yaml -
learn/forge-high-level.canvas
Goal:
- understand the example run at a high level
Phase 2: Main runtime path
-
forge.py -
src/data/datasets.py -
src/tasks/registry.py -
src/models/loader.py
Goal:
- understand where config, dataset, and model come from
Phase 3: Training behavior
-
src/training/trainer.py -
src/evaluation/evaluate.py
Goal:
- understand SFT first
- understand KD second
Phase 4: Infra behavior
-
src/training/callbacks.py -
src/utils/checkpoint.py -
scripts/slurm/run_training.sh
Goal:
- understand how long runs survive and resume
Phase 5: Later topics
-
src/training/rewards.py - RL configs
-
workbench/
Important:
- RL comes later
- Workbench comes later
- Do not block on GRPO right now
Questions You Should Be Able To Answer
- Where do the model and dataset come from?
- What key fields in
00_sft.yamlcontrol behavior? - What is the difference between task, model, and training mode?
- What major stages happen in one Forge run?
- What is the difference between
--configand--resume? - What files appear in the output directory, and why?
- Why did the first run stop after 10 steps?
- Why was this run SFT and not KD?
Minimal Practical Tasks
Task 1: Run environment check
- Run:
python test_setup.pyTask 2: Run a 10-step smoke test
- Run:
python forge.py --config configs/classification/mnli/distilbert/00_sft.yaml --max_steps 10Task 3: Inspect the first run outputs
- Open
run_metadata.json - Open
output.log - Open
training_state.json - Open
checkpoint-10/ - Explain what each file means
Current important files:
outputs/classification/mnli/distilbert/00_sft/20260310-0138Z/run_metadata.jsonoutputs/classification/mnli/distilbert/00_sft/20260310-0138Z/training_state.jsonoutputs/classification/mnli/distilbert/00_sft/20260310-0138Z/output.logoutputs/classification/mnli/distilbert/00_sft/20260310-0138Z/checkpoint-10/
Task 4: Explain the config in plain English
- Explain what
inherit: base.yamldoes - Explain what
model.namemeans - Explain why
teacheris present but unused - Explain why
alpha = 0.0means CE-only - Explain what
max_length = 128means
Task 5: Explain the MNLI task
- Explain what
premiseis - Explain what
hypothesisis - Explain the 3 labels:
- entailment
- neutral
- contradiction
- Explain what the model is predicting
Suggested Order For The Next Few Sessions
Session 1
- Run the project
- Find the output directory
- Understand what a
runis
Session 2
- Understand:
- task
- dataset
- model
- training mode
- Explain
MNLI + DistilBERT + SFT
Session 3
- Read
forge.pyfor flow only - Do not try to understand every function
- Identify:
- where config is loaded
- where dataset is loaded
- where model is loaded
- where training starts
Session 4
- Read
src/data/datasets.py - Read
src/tasks/registry.py - Understand how MNLI is loaded and tokenized
Session 5
- Read
src/models/loader.py - Understand what
from_pretrained(...)means - Understand why DistilBERT is used as the student
Session 6
- Read checkpoint / resume logic
- Understand:
training_state.jsonlatest_checkpointcheckpoint-*
PR-Readiness Checklist
- I can explain the full run flow in plain language
- I can explain the difference between task, model, and training mode
- I can reproduce a short run and locate its artifacts
- I can identify where a config value affects runtime behavior
- I can describe one safe, small change
- I can explain how I would validate that change
Notes
- Forge is centered on model training, not only LLMs
- The current learning priority is:
- SFT first
- KD second
- RL later
- Do not learn all of ML before reading this repo
- Learn concepts only when they unblock understanding of the current code path