Text-Guided Reconstruction Network for Sentiment Analysis with Uncertain Missing Modalities

Application of Text-Guided Reconstruction Network in Multimodal Sentiment Analysis

Academic Background

Multimodal Sentiment Analysis (MSA) is a research field that aims to integrate sentiment expressions from text, visual, and acoustic signals. With the abundance of user-generated online content, MSA demonstrates significant potential for improving emotional understanding and human-computer interaction. However, existing MSA methods face two main problems: 1) the dominant role of text is underutilized in unaligned multimodal data; 2) modalities under uncertain missing conditions are insufficiently explored. These issues limit the accuracy of emotional judgment, especially in real-world applications where factors such as background noise, sensor failure, face occlusion, poor lighting, transcript loss, and other factors may lead to random modality loss.

To address these issues, researchers have proposed a Text-Guided Reconstruction Network (TGRN), designed to handle uncertain missing modalities in non-aligned sequences. The network enhances the robustness of multimodal sentiment analysis through three primary modules: the Text-Guided Extraction Module (TEM), the Reconstruction Module (RM), and the Text-Guided Fusion Module (TFM).

Paper Source

The paper was co-authored by Piao Shi, Min Hu, Satoshi Nakagawa, Xiangming Zheng, Xuefeng Shi, and Fuji Ren, who are affiliated with Hefei University of Technology, the University of Tokyo, Bozhou University, and the University of Electronic Science and Technology of China. It was published in the Journal of LaTeX class files in August 2021 and accepted for publication in IEEE Transactions on Affective Computing.

Research Workflow

a) Research Workflow

  1. Text-Guided Extraction Module (TEM)
    The TEM module includes Text-Guided Cross Attention Units (TCA) and Self-Attention Units (SA) to capture inter-modal and intra-modal features, respectively. First, incomplete modality sequences are processed using a 1D temporal convolutional layer, followed by Position Embedding (PE) to augment the sequence’s temporal information. The Self-Attention Unit (SA) extracts intra-modal features by calculating attention between queries, keys, and values, while the Text-Guided Cross Attention Unit (TCA) uses textual modality features to guide and integrate visual and auditory modalities.

  2. Reconstruction Module (RM)
    The RM module aims to learn semantic information from incomplete data and reconstruct missing modality features. This module includes Enhanced Attention Units (EA) and a Three-Way Squeeze-and-Excitation Network (3SENet). The EA unit further explores interactions within each modality, while the 3SENet module extracts multi-dimensional features through horizontal, vertical max pooling, and global average pooling operations, enhancing the expressiveness of reconstructed features.

  3. Text-Guided Fusion Module (TFM)
    The TFM module employs a Progressive Modality-Mixing Adaptation Gate (PMAG) to explore dynamic correlations between nonverbal and verbal modalities, addressing the modality gap issue. The PMAG module calculates shift vectors for each modality and adjusts modality representations using these vectors, ultimately contributing to the sentiment prediction task.

b) Research Results

  1. Results of the Text-Guided Extraction Module (TEM)
    The TEM module effectively extracts features from text, visual, and auditory modalities through Self-Attention Units (SA) and Text-Guided Cross Attention Units (TCA). Experiments show that the text modality plays a dominant role in emotional expression, and nonverbal modalities are significantly enhanced through TCA units.

  2. Results of the Reconstruction Module (RM)
    The RM module successfully reconstructs missing modality features through EA units and 3SENet modules. Experiments on the CMU-MOSI and CH-SIMS datasets demonstrate that the RM module learns effective semantic information from incomplete data, significantly improving the accuracy of sentiment analysis.

  3. Results of the Text-Guided Fusion Module (TFM)
    The TFM module effectively addresses the modality gap problem through the PMAG module and achieves excellent performance in sentiment prediction tasks. Experimental results indicate that the TGRN model performs well under both complete and uncertain missing modality conditions.

Conclusion and Significance

The TGRN model proposed in this study effectively addresses the issue of uncertain missing modalities in multimodal sentiment analysis through three modules: text-guided extraction, modality reconstruction, and fusion. Experimental results show that TGRN outperforms state-of-the-art methods on the CMU-MOSI and CH-SIMS datasets. The scientific value of the model lies in its innovative use of the text modality to guide the feature expression of nonverbal modalities and its ability to handle missing modality issues through the reconstruction module. Additionally, the TGRN model demonstrates high robustness in practical applications, adapting well to complex real-world scenarios.

Research Highlights

  1. Importance of Text Guidance: This study is the first to propose using the text modality to guide the feature expression of visual and auditory modalities, significantly improving the accuracy of multimodal sentiment analysis.
  2. Innovation in Modality Reconstruction: Through Enhanced Attention Units and 3SENet modules, the RM module effectively reconstructs missing modality features from incomplete data.
  3. Dynamic Modality Fusion: The PMAG module dynamically adjusts modality representations, addressing the modality gap issue and further improving the precision of sentiment predictions.

Other Valuable Information

This study also used T-SNE visualization to display the distribution of different modality features in sentiment analysis, further verifying the dominant role of the text modality in multimodal sentiment analysis. Additionally, Bland-Altman plots were used to analyze the impact of each module on sentiment analysis results, proving the superiority of the TGRN model.

This research provides a new solution for multimodal sentiment analysis, offering significant theoretical and practical value. Future research can further explore the optimization of model parameters and solutions to dataset class imbalance issues to enhance the model’s performance.