Knowledge Distillation (KD) transfers behavior from a larger teacher model to a smaller student model.
Core idea
- Teacher model produces informative outputs (soft targets).
- Student model learns from both:
- hard labels (ground truth), and
- teacher signals (soft probabilities or logits).
- Goal: keep most quality while reducing model size and latency.
Why KD works
- Hard labels only tell which class is correct.
- Teacher distributions also tell which classes are similar or confusing.
- This richer signal helps student models generalize better than training on hard labels alone.
Typical objective
- Combine two losses:
- supervised loss on ground truth,
- distillation loss matching teacher output distribution.
- A weighted sum balances task fidelity and teacher imitation.
Generic workflow
train/choose teacher model
-> run teacher on training data
-> collect soft targets
-> train student with hard + soft targets
-> deploy student modelIn this project context
- DistilBERT model is a distilled student from a larger BERT model-style teacher.
- KD is the reason DistilBERT stays efficient while preserving strong language understanding.
- For downstream tasks like MNLI, SFT (Supervised Fine-Tuning) can be applied after distillation.
Benefits
- Smaller model size.
- Faster inference.
- Lower compute and memory cost.
- Better deployability on constrained hardware.
Limitations
- Student still has a quality ceiling below the teacher.
- Distillation setup adds training complexity.
- Poor teacher quality limits student quality.
When to use
- Inference speed and resource efficiency matter.
- You already have a strong teacher model.
- You need a practical production model without full-size cost.