Learning Semantic Consistency for Audio-Visual Zero-Shot Learning

Academic Background

In the field of artificial intelligence, Zero-Shot Learning (ZSL) is an extremely challenging task that aims to recognize unseen classes by leveraging knowledge from seen classes. Audio-Visual Zero-Shot Learning (AVZSL), a branch of ZSL, seeks to classify unseen classes by combining audio and visual information. However, many existing methods tend to focus on learning strong representations while neglecting the semantic consistency between audio and visual modalities as well as the inherent hierarchical structure of the data. This oversight can hinder the model’s ability to effectively classify unseen classes during testing, thereby limiting its performance in real-world applications.

To address this issue, a research team from Guizhou University, Shanghai Jiao Tong University, and Oklahoma State University proposed a novel framework—LSC-AVZSL (Learning Semantic Consistency for Audio-Visual Zero-Shot Learning). This framework enhances cross-modal information interaction and captures the intrinsic hierarchical structure of the data by introducing an attention mechanism and hyperbolic space, thereby improving the model’s performance.

Source of the Paper

The paper was co-authored by Xiaoyong Li, Jing Yang, Yuling Chen, Wei Zhang, Xiaoli Ruan, Chengjiang Li, and Zhidong Su. It was accepted by the journal Artificial Intelligence Review on April 10, 2025, and published in the same year. The title of the paper is “Learning Semantic Consistency for Audio-Visual Zero-Shot Learning,” and its DOI is 10.1007/s10462-025-11228-4.

Research Process

1. Problem Definition and Research Framework

In audio-visual zero-shot learning, the model needs to learn and classify samples from unseen classes. The research team proposed the LSC-AVZSL framework, which consists of three main modules: the Hyperbolic Space Module, the Transformer Module, and the Contrastive Loss Module. The hyperbolic space module captures the hierarchical structure of audio-visual data, the transformer module enhances cross-modal information interaction through a multi-head attention mechanism, and the contrastive loss module reduces the distance between features of different modalities through Noise Contrastive Estimation (NCE).

2. Hyperbolic Space Modeling

Audio-visual data often exhibits hierarchical structures. For example, the VGGSound-GZSLCls dataset contains nine major categories, while the ActivityNet-GZSLCls dataset has at least four levels of hierarchy. To effectively capture these hierarchical relationships, the research team projected the data into hyperbolic space. Hyperbolic space, with its negative curvature, can more naturally represent hierarchical structures. The specific steps include hyperbolic projection and logarithmic mapping. Hyperbolic projection maps points from Euclidean space into the Poincaré ball model in hyperbolic space, while logarithmic mapping locally linearizes points in hyperbolic space to facilitate numerical computation and optimization.

3. Audio-Visual Fusion Transformer

To learn multimodal representations, the research team designed a multimodal fusion transformer. This transformer consists of standard transformer layers, each including a multi-head self-attention mechanism (MSA) and a feedforward neural network (FFN). During training, the model learns multimodal representations by jointly inputting audio, visual, and their combinations. In this way, the model not only learns representations of individual modalities but also captures interactions between modalities.

4. Loss Function Design

The research team proposed a combinatorial contrastive loss function that considers interactions between different modality combinations. Specifically, it includes text-visual, text-audio, and audio-visual contrastive losses, as well as additional contrastive loss terms for cross-modal information exchange. Additionally, a hyperbolic alignment loss was introduced to minimize discrepancies between features of different modalities, and a reconstruction loss and regression loss were used to optimize the model’s training.

Main Results

1. Datasets and Experimental Results

The research team tested the LSC-AVZSL framework on three benchmark datasets: VGGSound-GZSLCls, UCF-GZSLCls, and ActivityNet-GZSLCls. The experimental results show that LSC-AVZSL achieved state-of-the-art performance on all three datasets. For example, on the UCF-GZSLCls dataset, LSC-AVZSL achieved a Harmonic Mean (HM) of 61.67%, a 5.2% improvement over the next best baseline method, ClipClap-GZSL. On the ActivityNet-GZSLCls dataset, LSC-AVZSL achieved an HM of 30.77%, while ClipClap-GZSL achieved an HM of 27.93%.

2. Visualization Analysis

Through t-SNE (t-Distributed Stochastic Neighbor Embedding) visualization, the research team demonstrated the distribution of the model’s input features and output embeddings. The results show that the audio-visual embeddings learned by the LSC-AVZSL model have clearer inter-class boundaries and more compact intra-class structures, proving the model’s effectiveness in capturing semantic consistency and hierarchical structures.

Conclusion and Significance

The LSC-AVZSL framework effectively addresses the issues of semantic inconsistency and insufficient hierarchical structure modeling in audio-visual zero-shot learning by introducing an attention mechanism and hyperbolic space. The framework not only achieved state-of-the-art performance on multiple benchmark datasets but also provided new insights for future multimodal fusion methods. The research team stated that they will continue to explore more efficient multimodal fusion methods and apply them to complex scenarios such as autonomous driving and intelligent surveillance.

Research Highlights

  1. Attention Mechanism: Enhanced information interaction between audio and visual modalities through a multi-head attention mechanism, improving semantic consistency.
  2. Hyperbolic Space: Leveraged hyperbolic space to capture the hierarchical structure of audio-visual data, enhancing the model’s representational capabilities.
  3. Combinatorial Contrastive Loss: Proposed a novel loss function that effectively reduces the distance between features of different modalities.
  4. Experimental Performance: Achieved state-of-the-art performance on multiple benchmark datasets, particularly excelling on the UCF-GZSLCls dataset.

Other Valuable Information

The research team has also made the code and data publicly available for further research and validation by other researchers. The code and data can be accessed via the following link: GitHub.

Through this research, the LSC-AVZSL framework provides a new solution for the field of audio-visual zero-shot learning and lays a solid foundation for future multimodal fusion research.