DistilBERT is a distilled, encoder-only Transformer model for text understanding.

Core idea

DistilBERT vs BERT

  • BERT: teacher model (larger baseline).
  • DistilBERT: student model (smaller compressed version).
  • The student learns to imitate teacher behavior.

Why use DistilBERT

  • faster inference
  • lower compute cost
  • smaller model size
  • practical deployment on constrained hardware

Architecture position

  • It is still based on encoder layers in Transformer.
  • It remains a text-understanding model, not a decoder-based generator.