Deepfake-Adapter: Dual-Level Adapter for Deepfake Detection

Deepfake-Adapter——A Dual-Level Adapter for Deepfake Detection

Research Background and Problem

With the rapid development of deep generative models, hyper-realistic facial images and videos can be easily generated, which are capable of deceiving the human eye. When such technology is maliciously abused, it may lead to serious misinformation problems in politics, entertainment, and society. This threat is known as “Deepfake.” To address this security issue, various deepfake detection methods have been proposed, showing promising performance when training and testing data come from identical manipulation types with good quality. However, their performance significantly degrades when encountering unseen or low-quality forged samples. This is mainly because most existing deepfake detection methods focus solely on low-level forgery features such as local textures, blending boundaries, or frequency information, while ignoring the role of high-level semantic information.

High-level semantic information plays a crucial role in deepfake detection. For example, certain face manipulation methods alter generic high-level semantic features like the style and shape of real faces, which are robust to variations in low-level features and thus can serve as important clues for detecting forgeries. Additionally, large-scale pre-trained Vision Transformers (ViTs) have demonstrated remarkable generalization capabilities in computer vision tasks, providing rich semantic representations that offer new possibilities for deepfake detection.

Based on this background, the authors propose a novel parameter-efficient tuning method—Deepfake-Adapter—which aims to achieve more generalized deepfake detection by integrating high-level semantic information from large-scale pre-trained ViTs with low-level forgery features.


Paper Source

This paper, titled “Deepfake-Adapter: Dual-Level Adapter for Deepfake Detection,” was co-authored by Rui Shao, Tianxing Wu, Liqiang Nie, and Ziwei Liu. The authors are affiliated with the School of Computer Science and Technology at Harbin Institute of Technology (Shenzhen) and the S-Lab at Nanyang Technological University in Singapore. The paper was accepted on September 30, 2024, and published in the prestigious journal International Journal of Computer Vision (IJCV), with the DOI: 10.1007/s11263-024-02274-6.


Research Details

a) Research Workflow

1. Method Overview

The proposed Deepfake-Adapter is a dual-level adapter architecture comprising Globally-Aware Bottleneck Adapters (GBA) and Locally-Aware Spatial Adapters (LSA). The core idea is to leverage high-level semantic information from large-scale pre-trained ViTs and extract global and local low-level forgery features through GBA and LSA modules, enabling efficient deepfake detection.

2. Specific Process

The research is divided into the following steps:

(1) Freezing and Adapting Pre-trained ViT
  • Object and Scale: The study uses a pre-trained ViT-Base model (85.8M parameters) and freezes its backbone.
  • Processing Method: GBA modules are inserted after each Multi-Head Self-Attention (MHSA) layer, and one LSA module is added per stage.
  • Experimental Design: The ViT is divided into three stages, each containing four blocks, with adapter modules introduced at each stage.
(2) Design and Function of GBA Modules
  • Object and Scale: A total of 12 GBA modules are inserted into the 12 MLP layers of the ViT.
  • Processing Method: GBA adopts a bottleneck structure, including a down-projection linear layer, ReLU activation function, and an up-projection linear layer, with a learnable scale function to adjust the importance of global low-level features.
  • Experimental Design: GBA modules primarily capture global low-level forgery features, such as blending boundaries.
(3) Design and Function of LSA Modules
  • Object and Scale: LSA modules consist of a head part (LSA-H) and an interaction part (LSA-I), totaling three LSAs.
  • Processing Method:
    • Head Part (LSA-H): Uses convolutional operations to extract local low-level forgery features from input images and projects them into a unified dimension.
    • Interaction Part (LSA-I): Employs Multi-Head Cross-Attention (MHCA) mechanisms to enable interaction between LSA features and ViT features.
  • Experimental Design: LSA modules mainly capture local low-level forgery features, such as local textures.
(4) Training and Testing
  • Object and Scale: Experiments were conducted on multiple public datasets, including FaceForensics++ (FF++), Celeb-DF, Deepfake Detection Challenge (DFDC), and DeeperForensics-1.0.
  • Processing Method: The model was trained on the FF++ dataset and tested across other datasets.
  • Experimental Design: End-to-end training was performed using cross-entropy loss and the SGD optimizer.

3. Novel Methods and Algorithms

  • GBA and LSA Modules: The design of these two modules is the core innovation of this paper, responsible for extracting global and local low-level forgery features.
  • Dual-Level Adapter Architecture: By organically integrating high-level semantic information with low-level forgery features, more generalized forgery representations are achieved.

b) Key Research Results

1. Intra-Dataset Evaluation

  • Experimental Setup: Tests were conducted on the C23 (high-quality) and C40 (low-quality) versions of the FF++ dataset.
  • Results:
    • On the C23 version, Deepfake-Adapter achieved near-saturated performance (>99% AUC) for most forgery types.
    • On the C40 version, Deepfake-Adapter achieved a 1%-2% AUC improvement in Deepfakes (DF), FaceSwap (FS), and Face2Face (F2F) forgery types.
  • Analysis: These results indicate that Deepfake-Adapter performs well not only in high-quality forgery detection but also maintains robustness in low-quality forgery detection.

2. Cross-Manipulation Evaluation

  • Experimental Setup: Cross-manipulation tests were conducted among different forgery types within the FF++ dataset.
  • Results:
    • Deepfake-Adapter achieved an average AUC improvement of 5%-6% in cross-manipulation evaluations.
    • In cross-manipulation tests for the Face2Face (F2F) forgery type, Deepfake-Adapter achieved the best average generalization performance.
  • Analysis: These results validate the generalization capability of Deepfake-Adapter on unseen forgery types.

3. Cross-Dataset Evaluation

  • Experimental Setup: The model was trained on the FF++ dataset and tested on the Celeb-DF and DFDC datasets.
  • Results:
    • Deepfake-Adapter achieved 71.74% and 72.66% AUC on the Celeb-DF and DFDC datasets, respectively, outperforming the current best method, Recce, by about 3%.
  • Analysis: These results demonstrate that Deepfake-Adapter’s generalization ability across different datasets significantly surpasses existing methods.

4. Robustness Against Low-Level Perturbations

  • Experimental Setup: Tests were conducted under seven unseen low-level perturbations (e.g., saturation, contrast, noise).
  • Results:
    • Deepfake-Adapter achieved the best or second-best performance under most perturbation conditions.
  • Analysis: These results further prove the robustness of Deepfake-Adapter against unseen low-level perturbations.

c) Research Conclusions and Value

Conclusions

This paper proposes a novel parameter-efficient tuning method—Deepfake-Adapter—which achieves more generalized deepfake detection by combining high-level semantic information from large-scale pre-trained ViTs with low-level forgery features.

Scientific Value

  • Theoretical Contribution: It is the first to introduce adapter techniques into the field of deepfake detection, offering new insights for future research.
  • Methodological Innovation: A novel dual-level adapter architecture effectively integrates global and local forgery features.

Application Value

  • Practical Applications: Deepfake-Adapter excels in cross-dataset and cross-manipulation evaluations, making it suitable for real-world deepfake detection scenarios.
  • Social Significance: It helps combat the misuse of deepfake technology, protecting the public from misinformation.

d) Research Highlights

  1. Key Findings: High-level semantic information plays a crucial role in deepfake detection.
  2. Problem Resolution: Solves the insufficient generalization capability of existing methods on unseen or low-quality forged samples.
  3. Methodological Innovation: Proposes a novel dual-level adapter architecture for parameter-efficient tuning.
  4. Experimental Design: Comprehensive quantitative and qualitative experiments were conducted on multiple public datasets to validate the method’s effectiveness.

e) Other Valuable Information

The paper also explores the impact of different pre-training weights, ViT architectures, and adapter configurations on model performance, further validating the compatibility and robustness of Deepfake-Adapter. Additionally, the authors used Grad-CAM visualization to reveal the model’s decision-making mechanism, highlighting its focus on forged regions.


Summary

“Deepfake-Adapter: Dual-Level Adapter for Deepfake Detection” is a research paper of significant scientific value and practical application. By proposing a dual-level adapter architecture, the authors successfully address the generalization issues in deepfake detection, paving the way for future research directions.