Forge Learning Plan

This document is the recommended learning path for this repo.

Goal:

Understand what Forge does
Learn the minimum ML / model / infra concepts needed for this codebase
Get to a first small PR without getting lost

Recommended Order

1. Run the project first

Finish environment setup
Run python test_setup.py
Run a smoke test with --max_steps 10
Confirm you know:
- where a run starts
- where outputs are written
- why the run stops

Current example run:

python forge.py --config configs/classification/mnli/distilbert/00_sft.yaml --max_steps 10

Current output directory:

outputs/classification/mnli/distilbert/00_sft/20260310-0138Z

2. Build the high-level Forge mental model

Learn the high-level pipeline:
- config -> dataset -> tokenizer/processor -> model(s) -> trainer -> checkpoints/logs -> evaluation -> outputs
Understand that Forge is a training framework around Hugging Face models
Understand that the config chooses:
- task
- model
- training mode

Important:

Do not read line-by-line yet
First understand the flow

3. Learn the core terminology

Minimum definitions to keep in mind:

run = one experiment execution
step = one optimizer update
epoch = one full pass over the training set
task = what problem you are solving
model = the neural network being trained
training mode = SFT / KD / RL

4. Learn the three most important separations

task vs model
model vs training mode
checkpoint vs final model

For this repo, this separation is critical:

MNLI = task / dataset
DistilBERT = model
SFT = training mode

5. Learn the current concrete example first

Start with this config:

configs/classification/mnli/distilbert/00_sft.yaml

Checklist:

I know this is an MNLI task
I know this uses DistilBERT
I know this run is CE-only SFT
I know alpha = 0.0 means no active distillation
I know --max_steps 10 overrode full training

What this config means in plain English:

Start from a pretrained DistilBERT model
Fine-tune it on the MNLI classification task
Predict the relationship between a premise and a hypothesis

What To Learn First

A. Hugging Face basics

What the Hugging Face Model Hub is
What the Hugging Face Datasets Hub is
What transformers provides
What datasets provides
How from_pretrained(...) works

Concrete examples from this repo:

distilbert-base-uncased
bert-base-uncased
glue / mnli

B. Minimum ML concepts

Do not learn everything first.
Only learn enough to explain what this run is doing.

C. Model concepts used by this repo

What a Transformer model is
What an encoder model is
Why BERT / DistilBERT are encoder models
What tokenization does
What model inputs and outputs look like

For the current run, you should be able to explain:

input = premise + hypothesis
output = 3-way classification logits

D. Forge-specific usage

YAML-driven configuration
Config inheritance via inherit
--config vs --resume
Output directory structure
Why run_metadata.json, training_state.json, and output.log matter

Learning Path By File

Phase 1: Files to understand first

README.md
configs/classification/mnli/distilbert/00_sft.yaml
configs/classification/mnli/distilbert/base.yaml
configs/classification/mnli/base.yaml
learn/forge-high-level.canvas

Goal:

understand the example run at a high level

Phase 2: Main runtime path

forge.py
src/data/datasets.py
src/tasks/registry.py
src/models/loader.py

Goal:

understand where config, dataset, and model come from

Phase 3: Training behavior

src/training/trainer.py
src/evaluation/evaluate.py

Goal:

understand SFT first
understand KD second

Phase 4: Infra behavior

src/training/callbacks.py
src/utils/checkpoint.py
scripts/slurm/run_training.sh

Goal:

understand how long runs survive and resume

Phase 5: Later topics

src/training/rewards.py
RL configs
workbench/

Important:

RL comes later
Workbench comes later
Do not block on GRPO right now

Questions You Should Be Able To Answer

Where do the model and dataset come from?
What key fields in 00_sft.yaml control behavior?
What is the difference between task, model, and training mode?
What major stages happen in one Forge run?
What is the difference between --config and --resume?
What files appear in the output directory, and why?
Why did the first run stop after 10 steps?
Why was this run SFT and not KD?

Minimal Practical Tasks

Task 1: Run environment check

Run:

python test_setup.py

Task 2: Run a 10-step smoke test

Run:

python forge.py --config configs/classification/mnli/distilbert/00_sft.yaml --max_steps 10

Task 3: Inspect the first run outputs

Current important files:

outputs/classification/mnli/distilbert/00_sft/20260310-0138Z/run_metadata.json
outputs/classification/mnli/distilbert/00_sft/20260310-0138Z/training_state.json
outputs/classification/mnli/distilbert/00_sft/20260310-0138Z/output.log
outputs/classification/mnli/distilbert/00_sft/20260310-0138Z/checkpoint-10/

Task 4: Explain the config in plain English

Explain what inherit: base.yaml does
Explain what model.name means
Explain why teacher is present but unused
Explain why alpha = 0.0 means CE-only
Explain what max_length = 128 means

Task 5: Explain the MNLI task

Explain what premise is
Explain what hypothesis is
Explain the 3 labels:
- entailment
- neutral
- contradiction
Explain what the model is predicting

Suggested Order For The Next Few Sessions

Session 1

Run the project
Find the output directory
Understand what a run is

Session 2

Understand:
- task
- dataset
- model
- training mode
Explain MNLI + DistilBERT + SFT

Session 3

Read forge.py for flow only
Do not try to understand every function
Identify:
- where config is loaded
- where dataset is loaded
- where model is loaded
- where training starts

Session 4

Read src/data/datasets.py
Read src/tasks/registry.py
Understand how MNLI is loaded and tokenized

Session 5

Read src/models/loader.py
Understand what from_pretrained(...) means
Understand why DistilBERT is used as the student

Session 6

Read checkpoint / resume logic
Understand:
- training_state.json
- latest_checkpoint
- checkpoint-*

PR-Readiness Checklist

I can explain the full run flow in plain language
I can explain the difference between task, model, and training mode
I can reproduce a short run and locate its artifacts
I can identify where a config value affects runtime behavior
I can describe one safe, small change
I can explain how I would validate that change

Notes

Forge is centered on model training, not only LLMs
The current learning priority is:
- SFT first
- KD second
- RL later
Do not learn all of ML before reading this repo
Learn concepts only when they unblock understanding of the current code path

Notes

Explorer

LEARNING

Forge Learning Plan

Recommended Order

1. Run the project first

2. Build the high-level Forge mental model

3. Learn the core terminology

4. Learn the three most important separations

5. Learn the current concrete example first

What To Learn First

A. Hugging Face basics

B. Minimum ML concepts

C. Model concepts used by this repo

D. Forge-specific usage

Learning Path By File

Phase 1: Files to understand first

Phase 2: Main runtime path

Phase 3: Training behavior

Phase 4: Infra behavior

Phase 5: Later topics

Questions You Should Be Able To Answer

Minimal Practical Tasks

Task 1: Run environment check

Task 2: Run a 10-step smoke test

Task 3: Inspect the first run outputs

Task 4: Explain the config in plain English

Task 5: Explain the MNLI task

Suggested Order For The Next Few Sessions

Session 1

Session 2

Session 3

Session 4

Session 5

Session 6

PR-Readiness Checklist

Notes

Table of Contents

Graph View

Table of Contents