Towards Zero-Shot Human–Object Interaction Detection via Vision–Language Integration

Research on Zero-Shot Human-Object Interaction Detection Based on Vision-Language Integration

Research on Zero-Shot Human-Object Interaction Detection Based on Vision-Language Integration

Academic Background

Human-Object Interaction (HOI) detection is an important research direction in the field of computer vision, aiming to identify interactions between humans and objects in images. Traditional HOI detection methods primarily rely on supervised learning, which requires a large amount of manually annotated data to train models. However, this approach has limited generalization capabilities when encountering unseen object categories. Moreover, the diversity and complexity of human-object interactions in the real world make it time-consuming and labor-intensive to manually annotate all possible interaction categories.

In recent years, with the rapid development of Vision-Language Models (VLM), Zero-Shot Learning has become a popular research direction. Zero-shot learning aims to enable models to recognize categories that were never seen during training. Against this backdrop, the authors propose a novel framework called “Knowledge Integration to HOI” (KI2HOI), which aims to enhance the performance of zero-shot HOI detection by integrating knowledge from vision-language models.

Source of the Paper

This paper is co-authored by Weiying Xue, Qi Liu, Yuxiao Wang, Zhenao Wei, Xiaofen Xing, and Xiangmin Xu, all affiliated with the South China University of Technology. The paper was published in the journal Neural Networks, Volume 187, 2025, with the paper number 107348.

Research Process

1. Framework Design

The core idea of the KI2HOI framework is to improve the performance of zero-shot HOI detection by integrating knowledge from vision-language models. Specifically, the framework includes the following main modules:

  • Visual Encoder: Extracts global visual features from images.
  • Verb Feature Learning: Extracts interaction-related features through verb queries.
  • Instance Interactor: Locates human-object pairs and classifies object categories.
  • Interaction Semantic Representation (ISR): Integrates visual and linguistic knowledge to generate interaction representations.

2. Visual Encoder

The visual encoder is based on the DETR (Detection Transformer) model, using ResNet-50 as the backbone network. To enhance the extraction of global features, the authors propose an HO-Pair Encoder, which consists of a local encoder and a global context generator, effectively capturing contextual information in images.

3. Verb Feature Learning

The verb feature learning module extracts interaction-related features through the interaction of verb queries with global visual features. Specifically, the authors design a module that combines self-attention and multi-head attention, along with a Feed-Forward Network (FFN) layer, to update verb queries.

4. Interaction Semantic Representation

The interaction semantic representation module generates interaction representations by integrating visual and linguistic knowledge. Specifically, the authors design an interaction representation decoder, which combines visual features and spatial features through a multi-head cross-attention mechanism to enhance interaction representation capabilities.

5. Training and Inference

During the training phase, the authors use the Hungarian Algorithm to match predictions with ground truths and design various loss functions, including bounding box regression loss and interaction classification loss. During the inference phase, the model generates the final HOI prediction results by integrating the scores of humans, objects, and verbs.

Main Results

1. Zero-Shot Detection

The authors conducted experiments with various zero-shot settings on the HICO-DET dataset, and the results show that KI2HOI performs exceptionally well on unseen interaction categories, especially on rare categories, significantly outperforming existing methods. For example, under the Rare First Unseen Combination (RF-UC) setting, KI2HOI improved the mean Average Precision (mAP) on unseen categories by 23.26% compared to the best existing method.

2. Fully Supervised Detection

To verify the model’s generalization ability, the authors also conducted fully supervised experiments on the HICO-DET and V-COCO datasets. The results indicate that KI2HOI outperforms existing methods in both full and rare categories, with particularly outstanding performance in rare categories.

3. Robustness Analysis

The authors also investigated the model’s robustness under different data quantities. The results show that even when the training data is reduced to 25%, KI2HOI still significantly outperforms existing methods in rare categories, demonstrating its potential in practical applications.

Conclusion and Significance

The KI2HOI framework significantly improves the performance of zero-shot HOI detection by integrating knowledge from vision-language models. The framework not only performs well in zero-shot settings but also demonstrates strong generalization capabilities in fully supervised settings. Additionally, KI2HOI’s performance in rare categories is particularly notable, providing new insights for addressing the long-tail distribution problem in HOI detection.

Research Highlights

  1. Novel Framework Design: The KI2HOI framework significantly improves zero-shot HOI detection performance by integrating knowledge from vision-language models.
  2. Strong Generalization Capability: KI2HOI not only performs well in zero-shot settings but also demonstrates strong generalization capabilities in fully supervised settings.
  3. Robustness Analysis: Even with reduced training data, KI2HOI still significantly outperforms existing methods in rare categories, demonstrating its potential in practical applications.

Other Valuable Information

This paper provides a new research direction for the field of HOI detection, particularly in the exploration of zero-shot learning and long-tail distribution problems, offering significant academic value and practical application significance.