Exploring Adaptive Inter-Sample Relationship in Data-Free Knowledge Distillation

In recent years, applications such as privacy protection and large-scale data transmission have posed significant challenges to the inaccessibility of data. Researchers have proposed Data-Free Knowledge Distillation (DFKD) methods to address these issues. Knowledge Distillation (KD) is a method for training a lightweight model (student model) to learn knowledge from a deeply pre-trained model (teacher model). However, traditional KD methods require accessible training data, which is impractical in scenarios involving privacy protection and large-scale data transmission. This paper proposes a new DFKD method—Adaptive Data-Free Knowledge Distillation (AdaDFKD)—aimed at overcoming the limitations of static target distributions and instance-level distribution learning in existing DFKD methods. It achieves adaptive learning of the student model by establishing and utilizing relationships between pseudo-samples, ultimately mitigating the aforementioned risks.

Research Background

In practical applications involving privacy protection or limited data transmission, it is often impossible to access the data required for training, rendering traditional KD methods inapplicable. To address this issue, DFKD has emerged. DFKD does not require real data; instead, it optimizes a generative model to produce pseudo-samples and uses these pseudo-samples to train the student model. However, existing DFKD methods typically employ static target distributions and focus on learning instance-level distributions, making them dependent on pre-trained teacher models and affecting their robustness.

Research Objectives

The purpose of this study is to propose a new DFKD method that significantly optimizes the generation process of pseudo-samples during the generation and training phases. It adopts a dynamic, adaptive approach to enhance the adaptability of DFKD to the student model, thereby ultimately improving the performance and robustness of DFKD.

Research Origin

The authors of this paper are Jingru Li, Sheng Zhou, Liangcheng Li, Haishuai Wang, Jiajun Bu, and Zhi Yu, all from the College of Computer Science and Technology at Zhejiang University. This paper was published in the journal Neural Networks.

Research Content

Research Process

The overall research process includes two main phases: the generation phase and the training phase. In the generation phase, a pseudo-sample generation module is used to generate the distribution of pseudo-sample data representations. In the training phase, the generated pseudo-samples are used to optimize the weights of the student model.

  1. Generation Phase:

    • Pseudo-samples are generated by a generator.
    • A Relationship Refinement Module (R2M) is defined to optimize the pseudo-sample generation process.
    • The progressive conditional distribution of negative samples is learned, and the log-likelihood of the similarity between pseudo-samples is maximized.
  2. Training Phase:

    • The generated pseudo-samples are used to train the student model.
    • During the training process, the student model extracts knowledge from the teacher model, which is stored in the pre-trained weights.
    • The alignment between the student model and the teacher model is strengthened by adaptively adjusting the relationships between pseudo-samples, ultimately improving the distillation effect.

Main Results

  1. Summary of Experimental Results:
    • Across multiple benchmark objects, teacher-student model pairs, and evaluation metrics, AdaDFKD outperformed the existing state-of-the-art DFKD methods.
    • By generating pseudo-samples that range from “easy to distinguish” to “difficult to distinguish,” AdaDFKD effectively improved the quality of pseudo-samples and gradually optimized the target distribution to better adapt to the student model.
    • The R2M module enhanced the similarity between pseudo-samples, further stabilizing the knowledge transfer between models.
    • The study systematically explored ideas from contrastive learning and unsupervised representation learning and applied them to the design and optimization of DFKD.

The summary is shown in the following table:

Teacher Student Compression Ratio Vanilla Teacher Accuracy (%) Vanilla Student Accuracy (%) DAFL (%) ZSKT (%) ADI (%) DFQ (%) CMI (%) PRE-DFKD (%) Cudfkd (%) AdaDFKD(ours) (%)
ResNet34 ResNet18 1.90 95.70 94.23 92.22 91.60 93.26 94.61 94.84 91.65 95.28 95.32
WRN40x2 WRN40x1 3.98 94.87 91.21 84.22 86.07 87.18 91.69 92.78 86.68 93.18 93.38
  1. Robustness Testing:
    • In scenarios involving “noisy” teacher models, AdaDFKD showed minimal performance degradation, indicating significant robustness in the presence of noisy teacher models.
    • In experiments with models having varying degrees of random labels, AdaDFKD still displayed strong decoupling and mode transfer capabilities.

Conclusions

  1. Scientific Value:

    • This study proposes a new DFKD method that addresses the limitations of static target distributions and dependence on instance-level distributions in existing DFKD methods, thereby improving the efficiency and robustness of DFKD methods.
    • By introducing dynamic relationship terms, the research demonstrates the importance of optimizing the target for both the generation and training phases by maximizing the mutual information between the distributions of the teacher and student models, which is proven theoretically and experimentally.
  2. Application Value:

    • This method provides a more robust and adaptable solution for DFKD in practical applications requiring privacy protection and large-scale transmission.
    • The application of random curriculum learning methods and contrastive learning ideas in DFKD offers a new perspective and approach for practical applications.

Highlights

  1. Important Findings:

    • This method surpasses existing state-of-the-art methods across multiple benchmarks and model pairs, demonstrating its superiority and innovation.
    • The proposed Relationship Refinement Module improves the quality of pseudo-samples during the generation and training phases, effectively enhancing the knowledge distillation effect.
  2. Novelty of the Method:

    • AdaDFKD dynamically learns targets to generate pseudo-samples from “easy to distinguish” to “difficult to distinguish,” enabling the student model to gradually adapt throughout the learning process.
    • The R2M module is novel, incorporating relationship learning ideas from contrastive learning and unsupervised representation learning into DFKD, realizing effective knowledge transfer in both theory and practice.
  3. Uniqueness:

    • This method not only provides a new DFKD framework but also proposes new optimization strategies for existing DFKD methods, which may have a profound impact on future DFKD research and applications.

Additional Information

This paper also explores related research outcomes in contrastive learning, unsupervised representation learning, and effectively applies them to the optimization of DFKD, enriching the theoretical framework and experimental validation of the research.

Through this study, the authors successfully demonstrate a more efficient and robust DFKD method, offering valuable reference and inspiration for future related research.