KD (Knowledge Distillation)

Knowledge Distillation (KD) transfers behavior from a larger teacher model to a smaller student model.

Core idea

Teacher model produces informative outputs (soft targets).
Student model learns from both:
- hard labels (ground truth), and
- teacher signals (soft probabilities or logits).
Goal: keep most quality while reducing model size and latency.

Why KD works

Hard labels only tell which class is correct.
Teacher distributions also tell which classes are similar or confusing.
This richer signal helps student models generalize better than training on hard labels alone.

Typical objective

Combine two losses:
- supervised loss on ground truth,
- distillation loss matching teacher output distribution.
A weighted sum balances task fidelity and teacher imitation.

Generic workflow

train/choose teacher model
  -> run teacher on training data
  -> collect soft targets
  -> train student with hard + soft targets
  -> deploy student model

In this project context

DistilBERT model is a distilled student from a larger BERT model-style teacher.
KD is the reason DistilBERT stays efficient while preserving strong language understanding.
For downstream tasks like MNLI, SFT (Supervised Fine-Tuning) can be applied after distillation.

Benefits

Smaller model size.
Faster inference.
Lower compute and memory cost.
Better deployability on constrained hardware.

Limitations

Student still has a quality ceiling below the teacher.
Distillation setup adds training complexity.
Poor teacher quality limits student quality.

When to use

Inference speed and resource efficiency matter.
You already have a strong teacher model.
You need a practical production model without full-size cost.

Notes

Explorer

KD (Knowledge Distillation)

Core idea

Why KD works

Typical objective

Generic workflow

In this project context

Benefits

Limitations

When to use

Table of Contents

Graph View

Table of Contents

Backlinks

Notes

Explorer

KD (Knowledge Distillation)

Core idea

Why KD works

Typical objective

Generic workflow

In this project context

Benefits

Limitations

When to use

Related notes

Graph View

Table of Contents

Backlinks