A Mutual Supervision Framework for Referring Expression Segmentation and Generation

A Mutual Supervision Framework for Referring Expression Segmentation and Generation

A Mutual Supervision Framework for Referring Expression Segmentation and Generation

Research Background and Problem Statement

In recent years, vision-language interaction technology has made remarkable progress in the field of artificial intelligence. Among these advancements, referring expression segmentation (RES) and referring expression generation (REG) are two core tasks. RES aims to locate target objects in an image and generate their segmentation masks based on natural language descriptions, while REG generates clear and accurate language descriptions for specific targets. Despite the fact that these two tasks are inherently inverse to each other, their research is often conducted separately, lacking systematic exploration of how they can mutually enhance each other.

The main issues faced by existing research include: 1) The RES task relies heavily on large amounts of annotated data, which is costly to obtain; 2) The expressions generated by REG may be ambiguous, making it difficult to accurately locate target objects; 3) Although joint training of RES and REG tasks has been explored, it remains unclear how to make them effectively benefit from each other in a joint learning framework. To address these issues, the authors propose a Transformer-based mutual supervision framework, which solves the above problems through two types of supervision mechanisms—disambiguation supervision and generation supervision—and significantly improves the performance of both tasks.

Source of the Paper

This paper was co-authored by Shijia Huang, Feng Li, Hao Zhang, Shilong Liu, Lei Zhang, and Liwei Wang, who are affiliated with the Chinese University of Hong Kong, International Digital Economy Academy (IDEA), Hong Kong University of Science and Technology, and Tsinghua University. The paper was published in the journal International Journal of Computer Vision, with the DOI 10.1007/s11263-024-02325-y, and the publication date is 2025.


Research Details and Workflow

a) Research Workflow

1. Framework Overview

The proposed mutual supervision framework includes three main modules: - Shared Proposal Extractor: Implemented using Mask2Former (Cheng et al., 2022), used to extract candidate objects from input images. - Indicated Generation Head: Used for the REG task to generate natural language descriptions for target objects. - Proposal Selection Head: Used for the RES task to select the best matching object based on language descriptions.

2. Specific Workflow

The research is carried out in three steps:

Step 1: End-to-End Joint Training
  • Research Object: Using three public datasets—RefCOCO, RefCOCO+, and RefCOCOG—all derived from MS-COCO (Lin et al., 2014), containing numerous images and their corresponding referring expressions.
  • Processing Method:
    • ResNet-101 is used as the visual backbone network to extract image features.
    • Mask2Former serves as the proposal extractor, generating 100 candidate objects and their segmentation masks.
    • The indicated generation head adopts a Transformer decoder architecture, combined with a novel indicator module to generate language descriptions.
    • The proposal selection head is also based on a Transformer decoder, selecting the best-matching object by calculating the matching scores between language descriptions and candidate objects.
  • Experimental Setup: AdamW optimizer is used with an initial learning rate of 5e-4, batch size of 8, and 90k iterations.
Step 2: Introduction of Disambiguation Supervision
  • Research Object: Same as above.
  • Processing Method:
    • In this phase, the proposal extractor and proposal selection head are frozen, and only the indicated generation head is optimized.
    • Reinforcement learning is introduced, designing an “unambiguity reward” using matching scores provided by the proposal selection head.
    • Automatic metric indicators (such as CIDEr) are combined to further optimize the generated results.
  • Experimental Setup: Learning rate reduced to 1e-6, batch size of 4, and 20k iterations.
Step 3: Introduction of Generation Supervision
  • Research Object: Unannotated MS-COCO instance segmentation data (approximately 87k images).
  • Processing Method:
    • The indicated generation head is used to automatically generate pseudo-expressions (Pseudo Expressions), expanding the training data for the RES task.
    • Area-based filtering and data reweighting strategies are adopted to reduce noise.
    • Pseudo-expressions are combined with real annotated data to retrain the entire framework.
  • Experimental Setup: Same as Step 1.

3. New Methods and Algorithms

  • Indicator Module: Assigns positive/negative indicators to each candidate object to guide the language generation process, ensuring that the generated expressions can distinguish the target object from the background.
  • Disambiguation Supervision: Designs a reward function using matching scores provided by the proposal selection head to enhance the unambiguity of generated expressions.
  • Generation Supervision: Expands the data scale for the RES task by automatically generating pseudo-expressions, while adopting filtering and reweighting strategies to improve data quality.

b) Main Results

1. Effectiveness of Disambiguation Supervision

  • On the RefCOCO+ test set, disambiguation supervision significantly improved the CIDEr score (from 0.879 to 0.927).
  • Human evaluation results show that the model-generated expressions have higher unambiguity (Top-1 Accuracy increased from 55% to 61%).
  • Qualitative analysis shows that after adding disambiguation supervision, the generated expressions become more detailed and precise. For example, “the second bear from the right” locates the target more accurately than “the bear on the right.”

2. Effectiveness of Generation Supervision

  • On the RefCOCO+ validation set, generation supervision increased the mIoU score by 1.46% (from 66.21% to 67.80%).
  • Data filtering and reweighting strategies significantly reduced the impact of noise brought by pseudo-expressions, especially on more challenging datasets (e.g., RefCOCO+).
  • The quality of pseudo-expressions is crucial: simply using category names or early model-generated expressions did not bring performance improvements.

3. Overall Performance Comparison

  • In the RES task, the proposed method outperformed existing state-of-the-art methods (such as RefTR and CRIS) in terms of average mIoU scores across all test sets, with an improvement of 5.97%.
  • In the REG task, the proposed method achieved significant leadership in the CIDEr metric, particularly on the most challenging RefCOCO+ TestB dataset, where the CIDEr score increased from 0.860 to 0.927.

c) Research Conclusions and Value

This study proposes an innovative mutual supervision framework that achieves joint optimization of RES and REG tasks through disambiguation supervision and generation supervision. This framework not only addresses the issue of insufficient data in the RES task but also significantly enhances the unambiguity of REG-generated expressions. The research findings hold significant scientific value in the field of vision-language interaction and demonstrate broad prospects in practical applications such as robot interaction and intelligent image retrieval.


d) Research Highlights

  1. Mutual Supervision Mechanism: The first systematic exploration of how RES and REG can mutually benefit in joint learning.
  2. Indicator Module: Designed a novel indicator module to flexibly guide the language generation process.
  3. Generation Supervision: Expanded the data scale for the RES task by automatically generating pseudo-expressions, significantly improving model performance.
  4. Performance Breakthrough: Set new performance records for RES and REG tasks on multiple public datasets.

e) Other Valuable Information

  • This study also verified the generalization ability of the framework on other datasets (e.g., PhraseCut and ReferItGame).
  • In terms of inference speed, although as a top-down method, the inference time of the proposed framework (261ms) is higher than bottom-up methods, it performs excellently in multi-query scenarios.

Conclusion

This paper successfully addresses key issues in referring expression segmentation and generation tasks by proposing a Transformer-based mutual supervision framework. Its innovative supervision mechanism and efficient data expansion strategy provide new ideas for research in the field of vision-language interaction, while also laying a solid foundation for practical applications.