- Turing Post
- Posts
- Topic 30: Everything You Need to Know about Knowledge Distillation
Topic 30: Everything You Need to Know about Knowledge Distillation
This is one of the hottest topics thanks to DeepSeek. Learn with us: the core idea, its types, scaling laws, real-world cases and useful resources to dive deeper
In the previous episode, we discussed Hugging Faceās āSmolā family of models and their effective strategy for training small LMs through high-quality dataset mixing. Today we want to go further in exploring training techniques for smaller models, making it the perfect time to discuss knowledge distillation (KD). Proposed a decade ago, this method has continued to evolve. For example, DeepSeekās advancements, particularly the effective distillation of DeepSeek-R1, have recently brought a wave of attention to this approach.
So, what is the key idea behind knowledge distillation? It enables to transfer knowledge from larger model, called teacher, to smaller one, called student. This process allows smaller models to inherit the strong capabilities of larger ones, avoiding the need for training from scratch and making powerful models more accessible. Letās explore how knowledge distillation has evolved over time, the different types of distillation that exist today, the key factors to consider for effective model distillation, and useful resources to master it.
In todayās episode, we will cover:
When did knowledge distillation appear as a technique?
A detailed explanation of knowledge distillation
Types of knowledge distillation
Improved algorithms
Distillation scaling laws
Benefits
Not without limitations
Real-world effective use cases (why OpenAI got mad at DeepSeek)
Conclusion
Sources and further reading: explore the references used to write this article and dive deeper with all the links provided in this section
When did knowledge distillation appear as a technique?
The ideas behind knowledge distillation (KD) date back to 2006, when BucilÄ, Caruana, and Niculescu-Mizil in their work āModel Compressionā showed that an ensemble of models could be compressed into a single smaller model without much loss in accuracy. They demonstrated that a cumbersome model (like an ensemble) could be effectively replaced by a lean model that was easier to deploy.
Later in 2015, Geoffrey Hinton, Oriol Vinyals, and Jeff Dean coined the term ādistillationā in their āDistilling the Knowledge in a Neural Networkā paper. This term was referred to the process of transferring knowledge from a large, complex AI model or ensemble to a smaller, faster AI model, called the distilled modelā. Instead of just training the smaller model on correct answers, researchers proposed to give it the probability distribution from the large model. This helps the smaller model learn not just what the right answer is, but also how confident the big model is about each option. This training concept is closely connected to the softmax function, so let's explore more precisely how this all works at the core.

Image Credit: āKnowledge Distillation: A Surveyā paper
A detailed explanation of knowledge distillation
Firstly, we need to clarify what is softmax.
Upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Simplify your learning journey ā
or follow us on Hugging Face, you can read this article there tomorrow for free

Reply