Prototype-Based Sample-Weighted Distillation Unified Framework Adapted to Missing Modality Sentiment Analysis

Application of a Prototype-Based Sample Weighted Distillation Unified Framework in Missing Modality Sentiment Analysis

Research Background

Sentiment analysis is a significant field in Natural Language Processing (NLP). With the development of social media platforms, people increasingly tend to express their emotions through short video clips. This has led to a rapid growth of multimodal data. However, in real life, one often encounters scenarios where modalities are missing due to issues such as audio loss, camera occlusion, or speech transcription errors. In such cases, sentiment analysis with missing modalities becomes a critical and challenging problem. The heterogeneity of multimodal data often leads to optimization imbalances when trying to optimize the same objective across all modalities in a multimodal network, especially in the presence of missing modalities. Existing research often overlooks the optimization imbalance issue when dealing with missing modalities.

Research Origin

Structure of the Prototype-Based Sample Weighted Distillation Unified Framework This paper was co-authored by Zhang Yujian, Liu Fanger, Zhuang Xuqiang, Hou Ying, and Zhang Yuling from the School of Information Science and Engineering at Shandong Normal University. The paper was published in the “Neural Networks” journal on May 20, 2024.

Research Process

1. Overview of the Research Process

To address the aforementioned issues, this paper proposes a Prototype-Based Sample Weighted Distillation Unified Framework (PSWD) and applies it to missing modality sentiment analysis. Specifically, PSWD employs a more efficient Transformer-based cross-modal layered cyclic fusion module for feature fusion. It also integrates a sample weighted distillation strategy and a prototype regularization network to tackle the issues of missing modalities and optimization imbalance. The main process of this study includes the following modules: feature encoder, invariant feature encoder, cross-modal layered recurrent fusion module, sentiment classifier, and prototype-based regularization network.

2. Detailed Process and Experimental Design

a. Feature Encoder Module The feature encoder module designs independent encoders for each modality (audio, visual, and text). For audio and visual modalities, LSTM networks and max-pooling layers are used to extract utterance-level features, while TextCNN is used for feature extraction of the text modality.

b. Invariant Feature Encoder Module The invariant feature encoder module consists of a fully connected layer, activation function, and dropout layers. It aims to map modality-specific features to a shared subspace using Central Moment Discrepancy (CMD) constraints to extract modality-invariant features.

c. Cross-Modal Layered Recurrent Fusion Module This module uses invariant features for fusion within a hierarchical mutual attention fusion structure. It achieves cross-modal fusion by ensuring the diversity of invariant features and operates within a layered structure to enable effective communication and complementarity across all modalities.

d. Classifier The final integrated features, combined with specific features, form a joint multimodal representation for sentiment classification. The sentiment classifier consists of multiple fully connected layers used to calculate the probability distribution of sentiment predictions.

e. Prototype Regularization Prototype regularization constructs a non-parametric classifier by introducing classification prototypes for each modality. It measures the distance between each sample and all prototypes to assess the performance of each modality and accelerates weak modality optimization through adaptive gradient adjustment.

Main Results

The paper conducted extensive experiments on two benchmark datasets (IEMOCAP and MSP-IMPROV). The experimental results indicate that PSWD achieved the best results compared to the latest baseline methods.

Research Conclusion

The PSWD framework proposed in this paper not only bridges the gap in full-modality sentiment analysis research but also addresses the issue of sentiment analysis with missing modalities. The use of a sample weighted distillation strategy and a prototype regularization network effectively tackles the optimization imbalance problem. Results indicate that the proposed method can achieve high robustness and broad adaptability in diverse application scenarios.

Research Highlights

  1. Novelty of the Method: A novel Transformer-based cross-modal layered recurrent fusion method is proposed.
  2. Sample Weighted Distillation: An innovative use of sample weighted distillation strategy improves model performance in the presence of missing modalities.
  3. Prototype Regularization Network: The prototype network helps adaptively adjust the optimization gradients of each modality.

Important Findings and Their Significance

The PSWD framework exhibits good performance in most missing modality scenarios, indicating high application value in real-world situations where modalities are missing. Moreover, this research is not limited to sentiment analysis and can be extended to other multimodal classification tasks, promising broader applications and promotion in various fields.