Robust Sequential Deepfake Detection
Robust Sequential Deepfake Detection
Academic Background
With the rapid development of deep generative models (such as GANs), generating photorealistic facial images has become increasingly easy. However, the misuse of this technology has raised significant security concerns, particularly with the rise of Deepfake technology. Deepfake technology can produce forged images that are almost indistinguishable from real ones, which can be used to spread misinformation, create fake news, and other malicious purposes. To address this issue, researchers have proposed various deepfake detection methods. However, existing methods primarily focus on detecting single-step facial manipulations. With the proliferation of user-friendly facial editing applications, people can now perform sequential multi-step manipulations on facial images. This new threat requires the ability to detect a sequence of facial manipulations, which is crucial for both detecting deepfake media and subsequently recovering the original facial images.
Based on this observation, this paper introduces a new research problem—Sequential Deepfake Detection (Seq-Deepfake). Unlike existing deepfake detection tasks that only require binary classification (real/fake), sequential deepfake detection demands the correct prediction of a sequence of facial manipulation operations. To support large-scale research, the authors constructed the first sequential deepfake dataset, which includes facial images manipulated through multi-step operations along with corresponding sequence annotations.
Paper Source
This paper is co-authored by Rui Shao, Tianxing Wu, and Ziwei Liu, affiliated with Harbin Institute of Technology (Shenzhen) and Nanyang Technological University, Singapore. The paper was accepted by Springer Nature’s International Journal of Computer Vision on December 1, 2024, and officially published in 2025.
Research Process and Experimental Design
1. Construction of the Sequential Deepfake Dataset
To support research on sequential deepfake detection, the authors constructed a large-scale sequential deepfake dataset (Seq-Deepfake Dataset). This dataset is based on two representative facial manipulation techniques: Facial Components Manipulation and Facial Attributes Manipulation. Unlike existing deepfake datasets that only provide binary labels, the proposed dataset includes annotations for manipulation sequences of varying lengths.
Facial Components Manipulation: Using the StyleMapGAN model, facial components (e.g., eyes, nose) from a reference image are transplanted onto the original image to generate manipulated images. Each manipulation step corresponds to a specific facial component operation.
Facial Attributes Manipulation: Using the Fine-grained Facial Editing method, facial attributes (e.g., age, smile intensity) are progressively modified to generate manipulated images. Each manipulation step corresponds to a specific facial attribute operation.
The final dataset contains over 85,000 manipulated facial images, covering manipulation sequences ranging from 1 to 5 steps.
2. Design of the Sequential Deepfake Detection Model
The authors proposed a Transformer-based sequential deepfake detection model—SeqFakeFormer. This model frames the sequential deepfake detection task as an Image-to-Sequence task, similar to image captioning. The core idea of SeqFakeFormer is to detect manipulation sequences by extracting spatial relation features from the image and modeling the sequential relations among these features.
Spatial Relation Extraction: First, a convolutional neural network (CNN) is used to extract feature maps from the input image. Then, self-attention mechanisms are employed to extract spatial relations from these feature maps, capturing spatial traces of manipulations.
Sequential Relation Modeling: Through cross-attention mechanisms, the extracted spatial relation features are aligned with the manipulation sequence annotations, modeling the sequential relations of manipulations. To enhance the effectiveness of cross-attention, the authors designed a Spatially Enhanced Cross-Attention (SECA) module, which enriches sequence information by learning a spatial weight map.
3. Enhancing Robustness in Sequential Deepfake Detection
To simulate real-world deepfake data distributions, the authors applied various perturbations (e.g., color distortion, noise, compression) to the original sequential deepfake dataset, creating a more challenging dataset—Seq-Deepfake-P. To address this more challenging scenario, the authors proposed an enhanced model—SeqFakeFormer++. This model introduces Image-Sequence Contrastive Learning (ISC) and Image-Sequence Matching (ISM) modules to further strengthen cross-modal reasoning between images and sequences, enabling more robust sequential deepfake detection under perturbations.
Key Results
1. Performance of Sequential Deepfake Detection
On the Seq-Deepfake dataset, SeqFakeFormer and SeqFakeFormer++ demonstrated excellent performance in detecting both facial components and facial attributes manipulations. Compared to existing multi-label classification methods, SeqFakeFormer achieved significant improvements in both fixed accuracy (Fixed-Acc) and adaptive accuracy (Adaptive-Acc). Particularly in adaptive accuracy, SeqFakeFormer outperformed other baseline methods, indicating its stronger capability in detecting manipulation sequences of varying lengths.
2. Robustness Testing
On the Seq-Deepfake-P dataset, SeqFakeFormer++ exhibited stronger robustness when faced with various perturbations. Compared to SeqFakeFormer, SeqFakeFormer++ showed improvements in both fixed and adaptive accuracy, especially in the task of facial components manipulation.
3. Failure Case Analysis
Although SeqFakeFormer and SeqFakeFormer++ performed well in most cases, failures still occurred in certain extreme scenarios. For example, the model might incorrectly predict the type, order, or length of manipulations. These failure cases highlight the ongoing challenges in sequential deepfake detection, particularly when dealing with hyper-realistic facial images and subtle manipulation traces.
Conclusion and Significance
This paper introduces a new research problem—sequential deepfake detection—and constructs the first large-scale sequential deepfake dataset. By framing sequential deepfake detection as an image-to-sequence task, the authors proposed the SeqFakeFormer model and further enhanced its robustness through the introduction of spatially enhanced cross-attention and cross-modal reasoning modules. Experimental results demonstrate that SeqFakeFormer and SeqFakeFormer++ have significant advantages in detecting sequential deepfakes, particularly in the face of real-world perturbations.
This research not only expands the scope of deepfake detection but also provides new directions for future research. By detecting sequential manipulation operations, the paper also offers the potential to recover original facial images, further enhancing the application value of deepfake detection.
Research Highlights
- Novel Research Problem: This paper is the first to propose the problem of sequential deepfake detection, expanding the scope of deepfake detection research.
- Large-Scale Dataset: The authors constructed the first dataset containing multi-step manipulation operations, providing detailed annotations for manipulation sequences.
- Innovative Model Design: The proposed SeqFakeFormer and SeqFakeFormer++ models significantly improve the performance and robustness of sequential deepfake detection through spatially enhanced cross-attention and cross-modal reasoning modules.
- Broad Application Prospects: By detecting sequential manipulation operations, this research also offers the potential to recover original facial images, with wide-ranging applications.
Future Research Directions
Although this paper has made significant progress in sequential deepfake detection, many questions remain for further exploration. For example, how to further improve the model’s robustness in extreme manipulation scenarios, and how to apply sequential deepfake detection to broader multimodal media manipulation detection tasks. Future research can build on this work to address the increasingly complex challenges posed by deepfake technology.