Forge Learning Plan

This document is the recommended learning path for this repo.

Goal:

  • Understand what Forge does
  • Learn the minimum ML / model / infra concepts needed for this codebase
  • Get to a first small PR without getting lost

1. Run the project first

  • Finish environment setup
  • Run python test_setup.py
  • Run a smoke test with --max_steps 10
  • Confirm you know:
    • where a run starts
    • where outputs are written
    • why the run stops

Current example run:

  • python forge.py --config configs/classification/mnli/distilbert/00_sft.yaml --max_steps 10

Current output directory:

  • outputs/classification/mnli/distilbert/00_sft/20260310-0138Z

2. Build the high-level Forge mental model

  • Learn the high-level pipeline:
    • config -> dataset -> tokenizer/processor -> model(s) -> trainer -> checkpoints/logs -> evaluation -> outputs
  • Understand that Forge is a training framework around Hugging Face models
  • Understand that the config chooses:
    • task
    • model
    • training mode

Important:

  • Do not read line-by-line yet
  • First understand the flow

3. Learn the core terminology

  • run
  • step
  • epoch
  • batch
  • checkpoint
  • resume
  • task
  • dataset
  • model
  • trainer
  • pretrained model
  • fine-tuning

Minimum definitions to keep in mind:

  • run = one experiment execution
  • step = one optimizer update
  • epoch = one full pass over the training set
  • task = what problem you are solving
  • model = the neural network being trained
  • training mode = SFT / KD / RL

4. Learn the three most important separations

  • task vs model
  • model vs training mode
  • checkpoint vs final model

For this repo, this separation is critical:

  • MNLI = task / dataset
  • DistilBERT = model
  • SFT = training mode

5. Learn the current concrete example first

Start with this config:

  • configs/classification/mnli/distilbert/00_sft.yaml

Checklist:

  • I know this is an MNLI task
  • I know this uses DistilBERT
  • I know this run is CE-only SFT
  • I know alpha = 0.0 means no active distillation
  • I know --max_steps 10 overrode full training

What this config means in plain English:

  • Start from a pretrained DistilBERT model
  • Fine-tune it on the MNLI classification task
  • Predict the relationship between a premise and a hypothesis

What To Learn First

A. Hugging Face basics

  • What the Hugging Face Model Hub is
  • What the Hugging Face Datasets Hub is
  • What transformers provides
  • What datasets provides
  • How from_pretrained(...) works

Concrete examples from this repo:

  • distilbert-base-uncased
  • bert-base-uncased
  • glue / mnli

B. Minimum ML concepts

  • Supervised learning
  • Classification
  • Training loop:
    • forward pass
    • loss
    • backward pass
    • optimizer step
  • Cross-entropy
  • Accuracy / F1 / MCC

Do not learn everything first.
Only learn enough to explain what this run is doing.

C. Model concepts used by this repo

  • What a Transformer model is
  • What an encoder model is
  • Why BERT / DistilBERT are encoder models
  • What tokenization does
  • What model inputs and outputs look like

For the current run, you should be able to explain:

  • input = premise + hypothesis
  • output = 3-way classification logits

D. Forge-specific usage

  • YAML-driven configuration
  • Config inheritance via inherit
  • --config vs --resume
  • Output directory structure
  • Why run_metadata.json, training_state.json, and output.log matter

Learning Path By File

Phase 1: Files to understand first

  • README.md
  • configs/classification/mnli/distilbert/00_sft.yaml
  • configs/classification/mnli/distilbert/base.yaml
  • configs/classification/mnli/base.yaml
  • learn/forge-high-level.canvas

Goal:

  • understand the example run at a high level

Phase 2: Main runtime path

  • forge.py
  • src/data/datasets.py
  • src/tasks/registry.py
  • src/models/loader.py

Goal:

  • understand where config, dataset, and model come from

Phase 3: Training behavior

  • src/training/trainer.py
  • src/evaluation/evaluate.py

Goal:

  • understand SFT first
  • understand KD second

Phase 4: Infra behavior

  • src/training/callbacks.py
  • src/utils/checkpoint.py
  • scripts/slurm/run_training.sh

Goal:

  • understand how long runs survive and resume

Phase 5: Later topics

  • src/training/rewards.py
  • RL configs
  • workbench/

Important:

  • RL comes later
  • Workbench comes later
  • Do not block on GRPO right now

Questions You Should Be Able To Answer

  • Where do the model and dataset come from?
  • What key fields in 00_sft.yaml control behavior?
  • What is the difference between task, model, and training mode?
  • What major stages happen in one Forge run?
  • What is the difference between --config and --resume?
  • What files appear in the output directory, and why?
  • Why did the first run stop after 10 steps?
  • Why was this run SFT and not KD?

Minimal Practical Tasks

Task 1: Run environment check

  • Run:
python test_setup.py

Task 2: Run a 10-step smoke test

  • Run:
python forge.py --config configs/classification/mnli/distilbert/00_sft.yaml --max_steps 10

Task 3: Inspect the first run outputs

  • Open run_metadata.json
  • Open output.log
  • Open training_state.json
  • Open checkpoint-10/
  • Explain what each file means

Current important files:

  • outputs/classification/mnli/distilbert/00_sft/20260310-0138Z/run_metadata.json
  • outputs/classification/mnli/distilbert/00_sft/20260310-0138Z/training_state.json
  • outputs/classification/mnli/distilbert/00_sft/20260310-0138Z/output.log
  • outputs/classification/mnli/distilbert/00_sft/20260310-0138Z/checkpoint-10/

Task 4: Explain the config in plain English

  • Explain what inherit: base.yaml does
  • Explain what model.name means
  • Explain why teacher is present but unused
  • Explain why alpha = 0.0 means CE-only
  • Explain what max_length = 128 means

Task 5: Explain the MNLI task

  • Explain what premise is
  • Explain what hypothesis is
  • Explain the 3 labels:
    • entailment
    • neutral
    • contradiction
  • Explain what the model is predicting

Suggested Order For The Next Few Sessions

Session 1

  • Run the project
  • Find the output directory
  • Understand what a run is

Session 2

  • Understand:
    • task
    • dataset
    • model
    • training mode
  • Explain MNLI + DistilBERT + SFT

Session 3

  • Read forge.py for flow only
  • Do not try to understand every function
  • Identify:
    • where config is loaded
    • where dataset is loaded
    • where model is loaded
    • where training starts

Session 4

  • Read src/data/datasets.py
  • Read src/tasks/registry.py
  • Understand how MNLI is loaded and tokenized

Session 5

  • Read src/models/loader.py
  • Understand what from_pretrained(...) means
  • Understand why DistilBERT is used as the student

Session 6

  • Read checkpoint / resume logic
  • Understand:
    • training_state.json
    • latest_checkpoint
    • checkpoint-*

PR-Readiness Checklist

  • I can explain the full run flow in plain language
  • I can explain the difference between task, model, and training mode
  • I can reproduce a short run and locate its artifacts
  • I can identify where a config value affects runtime behavior
  • I can describe one safe, small change
  • I can explain how I would validate that change

Notes

  • Forge is centered on model training, not only LLMs
  • The current learning priority is:
    • SFT first
    • KD second
    • RL later
  • Do not learn all of ML before reading this repo
  • Learn concepts only when they unblock understanding of the current code path