DistilBERT is a distilled, encoder-only Transformer model for text understanding.
Core idea
- DistilBERT is a compressed version of BERT model.
- It is trained with KD (Knowledge Distillation).
- Goal: keep most language understanding performance with lower cost.
DistilBERT vs BERT
- BERT: teacher model (larger baseline).
- DistilBERT: student model (smaller compressed version).
- The student learns to imitate teacher behavior.
Why use DistilBERT
- faster inference
- lower compute cost
- smaller model size
- practical deployment on constrained hardware
Architecture position
- It is still based on encoder layers in Transformer.
- It remains a text-understanding model, not a decoder-based generator.