Multi-Grained Visual Pivot-Guided Multi-Modal Neural Machine Translation with Text-Aware Cross-Modal Contrastive Disentangling

Multi-Scale Vision-Centric Multi-Modal Neural Machine Translation: Text-Aware Cross-Modality Contrastive Decoupling

Academic Background

Multi-Modal Neural Machine Translation (MNMT) aims to incorporate language-independent visual information into text to enhance machine translation performance. However, due to the significant modal differences between image and text, semantic mismatches are inevitable between the two. The objective of addressing these issues is to improve alignment between different languages using decomposed multi-scale visual information as a cross-language pivot, thus enhancing the performance of MNMT.

Paper Source The neural network architecture designed in this study

This paper, authored by researchers Junjun Zhu, Rui Su, and Junjie Ye, originates from the School of Information Engineering and Automation at Kunming University of Science and Technology, the School of Information Science and Engineering at Yunnan University, and the Yunnan Key Laboratory of Artificial Intelligence. It will be published in 2024 in the renowned journal “Neural Networks.”

Research Process

The research work is primarily divided into the following steps:

  1. Proposing a Multi-Scale Vision-Centric Multi-Modal Fusion Strategy: The authors created a framework named “ConVisPiv-MNMT,” which eliminates the language gap between different languages through cross-modal contrastive decoupling. They use a text-guided stacked cross-modal decoupling module to progressively decouple the image into two types of visual information: visual information related to machine translation (MT) and background information.

  2. Establishing a Text-Guided Cross-Modal Decoupling Strategy: In the stacked Transformer encoding layers, a text-guided cross-modal decoupling strategy is designed to decouple visual features into text-related and background visual information at each layer. This is achieved through cross-modal gating mechanisms, which embed the coarsely decoupled visual information into the text layer by layer.

  3. Designing a Multi-Scale Vision-Guided Transformer Decoder: Two types of decoupled visual information are used as visual pivots to bridge the language gap. This consists of three primary components: the target sentence embedding, cross-language alignment module, and multi-scale vision-centric enhanced target performance.

Research Results

Extensive experiments on four benchmark MNMT datasets demonstrate that the proposed methods outperform other state-of-the-art methods across all test sets. During the experiments:

  • Significant Improvement in Multi-Scale Visual Information Fusion: Through layer-by-layer decoupling of image information, higher cross-language alignment precision and better target sentence generation are achieved. In the experiments, notable improvements in multiple metrics (like BLEU and METEOR) can be seen when compared to the “Multi30k” dataset.
  • Effectiveness of Contrastive Analysis: The contrastive analysis shows that the text-guided cross-modal decoupling and vision-centric multi-modal fusion strategies bring significant performance enhancements to MNMT.

Specific experimental results data are as follows:

  • On the “Multi30k” dataset, the proposed method achieved an increase of 1-2.3 BLEU and METEOR scores over other state-of-the-art methods on the English-German and English-French translation tasks.
  • The proposed method also showed superior robustness and generality in specific and multi-domain datasets like Fashion-MMT, achieving the highest translation scores across multiple language pairs such as English-Chinese, English-German, English-Spanish, and English-French.

Conclusion and Value

This research successfully eliminates the semantic gap between different languages by introducing a multi-scale vision-centric multi-modal fusion strategy, significantly enhancing MNMT’s translation performance. Its scientific value lies in innovatively combining text and visual information decoupling strategies, providing a more precise multi-modal fusion framework for machine translation. The application value manifests in effectively handling translation tasks across different and multi-domains, showing strong robustness and broad application prospects.

Research Highlights

  1. Novel Methodology: A novel multi-scale vision-centric multi-modal fusion strategy is proposed, which significantly reduces the semantic gap between languages through text-guided cross-modal contrastive decoupling.
  2. Outstanding Experimental Results: Demonstrated substantial performance improvements on multiple datasets compared to existing methods, with a certain degree of generality and robustness.
  3. Effectiveness of Visual Information: Experiments prove the potential value of visual information in enhancing machine translation performance, showcasing excellent performance even in scenarios with varying quality of visual information.

Other Valuable Information

  1. Robustness Testing of Visual Information in Different Scenarios: The authors conducted experiments in various quality visual information scenarios (such as high quality, noise-added, unrelated visual information, and blank visual information) to verify the impact of visual information on machine translation performance. Results show that the proposed method can maintain high performance across all tested scenarios, particularly in high-noise and unrelated visual information scenarios, demonstrating good robustness.

  2. Evaluation of Complexity and Computational Cost: By introducing multiple computational complexity metrics (including model parameter count, floating-point operations, and GPU utilization rate), the computational efficiency of the proposed method was assessed. Experiments indicate that although this method is slightly higher in computational cost than other methods, its significant performance improvement does not notably increase model parameters or sacrifice model efficiency, validating the method’s effectiveness and computational feasibility.

This research, through innovative integration with traditional machine translation methods and multi-modal information, provides new approaches and ideas for the machine translation field and is expected to further advance development in this area.