Multimodal Sentiment Analysis with Mutual Information-Based Disentangled Representation Learning
Disentangled Representation Learning in Multimodal Sentiment Analysis Using Mutual Information: An Innovative Study
Academic Background
With the rapid development of social media, the amount of user-generated multimedia content (such as tweets and videos) has increased dramatically. These multimedia data typically include three modalities: visual (images), acoustic (voice), and text. These data contain rich emotional information, and how to automatically analyze this emotional information has become an important challenge. Multimodal Sentiment Analysis (MSA) aims to identify underlying emotions and sentiments using various signals. However, multimodal representation learning is one of the core challenges in this field, which focuses on effectively integrating features from different modalities into a unified representation.
In recent years, researchers have proposed two main approaches to address this issue: one method decomposes multimodal features into modality-invariant and modality-specific components, while the other leverages mutual information (MI) to enhance the fusion of modalities. Both methods have achieved certain successes but still face some unresolved issues. For example, existing methods often focus only on modality-invariant and modality-specific information, neglecting the role of modality-complementary information. Additionally, the decoupling of multimodal features and the quantitative analysis of information content have not been thoroughly studied.
Source of the Paper
This paper was co-authored by Hao Sun, Ziwei Niu, Hongyi Wang, Xinyao Yu, Jiaqing Liu, Yen-Wei Chen, and Lanfen Lin. Among them, Hao Sun and Ziwei Niu are co-first authors, while Yen-Wei Chen and Lanfen Lin are corresponding authors. The authors are affiliated with the College of Computer Science and Technology at Zhejiang University and the College of Information Science and Engineering at Ritsumeikan University, Japan. The paper has been published in the IEEE Transactions on Affective Computing journal and is expected to be officially released in 2025.
Research Workflow and Details
1. Research Framework
This study proposes a Mutual Information-based Disentangled Multimodal Representation Learning framework (MIMRL), dividing multimodal processing into two phases: feature extraction and fusion.
Feature Extraction Phase
During the feature extraction phase, the research team identified three types of useful information contained in multimodal features:
1. Modality-Invariant Information: Shared among different modalities, pointing to common semantics.
2. Modality-Specific Information: Unique to each modality but still relevant to the final prediction.
3. Modality-Complementary Information: Predictive information that arises only when two or more modalities are combined.
The research team utilized mutual information (MI) and conditional mutual information (CMI) to quantify these types of information and optimized feature extraction by adjusting their proportions.
Fusion Phase
During the fusion phase, the research team promoted multimodal fusion by maximizing the mutual information between each modality’s representation and the fused representation. Additionally, they quantitatively analyzed the contribution of each modality in the fused representation.
2. Experimental Setup and Datasets
The research team conducted experiments on four public datasets, including CMU-MOSI, CMU-MOSEI, Hazumi1911, and AVEC2019. These datasets were used for sentiment analysis and depression detection tasks.
CMU-MOSI and CMU-MOSEI
These two datasets contain multimodal (text, acoustic, and visual) sentiment analysis data, with labels being real numbers in the interval [-3, 3], representing the intensity of negative to positive emotions.
Hazumi1911
This dataset introduces physiological signals as a fourth modality for sentiment analysis.
AVEC2019
This dataset is used for depression detection tasks, with labels being real numbers in the interval [0, 24], representing the degree of depression.
3. Experimental Methods
Modality Representation Generation and Fusion
Before fusion, the research team used LSTM (Long Short-Term Memory) to generate representations for acoustic and visual modalities and BERT to generate text representations. Then, the multimodal features were fused into a unified representation through a fusion encoder.
Information Maximization
During the feature extraction phase, the research team estimated modality-invariant, -specific, and -complementary information using MI and CMI, adjusting their proportions via loss functions. During the fusion phase, they optimized the fusion effect by maximizing the mutual information between each modality and the fused representation.
4. Experimental Results
The research team found that the proposed framework achieved state-of-the-art performance across multiple datasets. For example, on the CMU-MOSI dataset, the MAE (Mean Absolute Error) was 0.687, and the Pearson correlation coefficient was 0.792; on the CMU-MOSEI dataset, the MAE was 0.513, and the Pearson correlation coefficient was 0.801. Additionally, the research team discovered that different tasks rely on modality information to varying degrees. For instance, in sentiment analysis tasks, specific information from the text modality dominated, while in depression detection tasks, complementary information was more significant.
Conclusion and Significance
This study proposes a mutual information-based disentangled multimodal representation learning framework, combining multimodal decoupling methods with mutual information methods for the first time to address key challenges in multimodal representation learning. By quantitatively analyzing and optimizing the proportions of modality-invariant, -specific, and -complementary information, the research team achieved significant performance improvements in multimodal sentiment analysis and depression detection tasks.
Research Highlights
- Innovation: Combines multimodal decoupling with mutual information methods for the first time, proposing a novel multimodal representation learning framework.
- Quantitative Analysis: Provides theoretical support for multimodal fusion by quantitatively estimating the proportions of modality information using mutual information and conditional mutual information.
- Broad Applicability: Demonstrates the framework’s effectiveness across multiple public datasets, showcasing its versatility in different tasks.
Future Prospects
Despite the significant achievements of this study, the current adjustment of information proportions still relies on manual parameter tuning, limiting its scalability in practical applications. Future research will focus on developing adaptive methods to automatically adjust information proportions, further advancing multimodal representation learning.