Knowledge Distillation (KD) transfers behavior from a larger teacher model to a smaller student model.

Core idea

  • Teacher model produces informative outputs (soft targets).
  • Student model learns from both:
    • hard labels (ground truth), and
    • teacher signals (soft probabilities or logits).
  • Goal: keep most quality while reducing model size and latency.

Why KD works

  • Hard labels only tell which class is correct.
  • Teacher distributions also tell which classes are similar or confusing.
  • This richer signal helps student models generalize better than training on hard labels alone.

Typical objective

  • Combine two losses:
    • supervised loss on ground truth,
    • distillation loss matching teacher output distribution.
  • A weighted sum balances task fidelity and teacher imitation.

Generic workflow

train/choose teacher model
  -> run teacher on training data
  -> collect soft targets
  -> train student with hard + soft targets
  -> deploy student model

In this project context

Benefits

  • Smaller model size.
  • Faster inference.
  • Lower compute and memory cost.
  • Better deployability on constrained hardware.

Limitations

  • Student still has a quality ceiling below the teacher.
  • Distillation setup adds training complexity.
  • Poor teacher quality limits student quality.

When to use

  • Inference speed and resource efficiency matter.
  • You already have a strong teacher model.
  • You need a practical production model without full-size cost.