Papers
arxiv:2602.01395

Rethinking Selective Knowledge Distillation

Published on Feb 1
· Submitted by
Almog Tavor
on Feb 3
Authors:
,

Abstract

Selective knowledge distillation in autoregressive language models using student-entropy-guided position selection improves accuracy and efficiency while reducing memory and storage requirements.

AI-generated summary

Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.

Community

Paper author Paper submitter

Huggingface needs to add a down vote button. Become Human first and cut ties with occupation and zionism than your work can be regarded. Researchers are human first and institutes corps, orgs , or whatever that violates human rights specifically speak about Palestine case here are disregarded and ignored.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.01395 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.01395 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.