• Turing Post
  • Posts
  • Topic 30: Everything You Need to Know about Knowledge Distillation

Topic 30: Everything You Need to Know about Knowledge Distillation

This is one of the hottest topics thanks to DeepSeek. Learn with us: the core idea, its types, scaling laws, real-world cases and useful resources to dive deeper

In the previous episode, we discussed Hugging Faceā€™s ā€œSmolā€ family of models and their effective strategy for training small LMs through high-quality dataset mixing. Today we want to go further in exploring training techniques for smaller models, making it the perfect time to discuss knowledge distillation (KD). Proposed a decade ago, this method has continued to evolve. For example, DeepSeekā€™s advancements, particularly the effective distillation of DeepSeek-R1, have recently brought a wave of attention to this approach.

So, what is the key idea behind knowledge distillation? It enables to transfer knowledge from larger model, called teacher, to smaller one, called student. This process allows smaller models to inherit the strong capabilities of larger ones, avoiding the need for training from scratch and making powerful models more accessible. Letā€™s explore how knowledge distillation has evolved over time, the different types of distillation that exist today, the key factors to consider for effective model distillation, and useful resources to master it.

In todayā€™s episode, we will cover:

  • When did knowledge distillation appear as a technique?

  • A detailed explanation of knowledge distillation

  • Types of knowledge distillation

  • Improved algorithms

  • Distillation scaling laws

  • Benefits

  • Not without limitations

  • Real-world effective use cases (why OpenAI got mad at DeepSeek)

  • Conclusion

  • Sources and further reading: explore the references used to write this article and dive deeper with all the links provided in this section

When did knowledge distillation appear as a technique?

The ideas behind knowledge distillation (KD) date back to 2006, when Bucilă, Caruana, and Niculescu-Mizil in their work ā€œModel Compressionā€ showed that an ensemble of models could be compressed into a single smaller model without much loss in accuracy. They demonstrated that a cumbersome model (like an ensemble) could be effectively replaced by a lean model that was easier to deploy.

Later in 2015, Geoffrey Hinton, Oriol Vinyals, and Jeff Dean coined the term ā€œdistillationā€ in their ā€œDistilling the Knowledge in a Neural Networkā€ paper. This term was referred to the process of transferring knowledge from a large, complex AI model or ensemble to a smaller, faster AI model, called the distilled modelā€‹. Instead of just training the smaller model on correct answers, researchers proposed to give it the probability distribution from the large model. This helps the smaller model learn not just what the right answer is, but also how confident the big model is about each option. This training concept is closely connected to the softmax function, so let's explore more precisely how this all works at the core.

Image Credit: ā€œKnowledge Distillation: A Surveyā€ paper

A detailed explanation of knowledge distillation

Firstly, we need to clarify what is softmax.

Upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Simplify your learning journey ā†’

or follow us on Hugging Face, you can read this article there tomorrow for free

Reply

or to participate.