Global and Local Maximum Concept Matching for Zero-Shot Out-of-Distribution Detection

2025-02-13 Thu
zero-shot detection multi-object scenarios vision-language model out-of-distribution detection maximum concept matching
GL-MCM: Global and Local Maximum Concept Matching for Zero-Shot Out-of-Distribution DetectionResearch Background and Problem StatementIn real-world applications, machine learning models often face changes in data distribution, such as the emergence of new categories. This phenomenon is known as “Out-of-Distribution Detection (OOD).” To ensure the reliability of models on unknown data, OOD detection has become a critical task. However, traditional single-modal supervised learning methods, while performing well on specific tasks, are costly to train and struggle to adapt to diverse application scenarios.
In recent years, zero-shot out-of-distribution detection methods based on CLIP (Contrastive Language–Image Pre-training) have garnered significant attention. CLIP is a multi-modal pre-trained model capable of learning visual features through natural language supervision. Although existing methods like MCM (Maximum Concept Matching) perform well in zero-shot scenarios, they typically assume that input images contain only a single, centered object, ignoring more complex multi-object scenes. In these scenarios, images may simultaneously contain both in-distribution (ID) and out-of-distribution (OOD) objects. Therefore, designing a flexible and efficient detection method that adapts to different types of ID images has become an urgent problem to solve.
Source of the PaperThis paper, titled “GL-MCM: Global and Local Maximum Concept Matching for Zero-Shot Out-of-Distribution Detection”, was co-authored by Atsuyuki Miyai, Qing Yu, Go Irie, and Kiyoharu Aizawa. The authors are affiliated with the University of Tokyo, LY Corporation, and Tokyo University of Science. The paper was accepted on January 6, 2025, and published in the prestigious journal International Journal of Computer Vision, with the DOI 10.1007/s11263-025-02356-z.
Research Details and Workflowa) Research Workflow and Methodology1. Overview of the MethodThe authors propose a new method called GL-MCM (Global-Local Maximum Concept Matching), which combines global and local vision-language alignments to enhance detection performance. The core idea of GL-MCM is to use CLIP’s local features as auxiliary scores to compensate for the shortcomings of global features in multi-object scenarios.
2. Main StepsThe research is divided into several key steps:
Global Feature Extraction

The global features from CLIP are used as the basis to compute the similarity between images and text. Specifically, CLIP’s image encoder aggregates feature maps into a global feature vector (x’) via an attention pooling layer and projects it into the text space.
Local Feature Extraction

The authors introduce the concept of local features by projecting the value features from CLIP’s last attention layer to obtain vision-language aligned local features. These local features retain rich spatial information, capturing objects in each region of the image.
Local Maximum Concept Matching (L-MCM)

Based on local features, the authors propose L-MCM, which enhances the separability of local features through softmax scaling. The formula is as follows:
[
S{l-mcm} = \max{t,i} \frac{e^{sim(x’_i, yt)/\tau}}{\sum{c \in T_{in}} e^{sim(x’_i, y_c)/\tau}}
]
Here, (sim(u_1, u_2)) represents cosine similarity, and (\tau) is the temperature parameter.
Global-Local Maximum Concept Matching (GL-MCM)

GL-MCM combines global and local scores to form the final detection score:
[
S{gl-mcm} = S{mcm} + \lambda S_{l-mcm}
]
Here, (\lambda) is a hyperparameter used to control the weight of global and local scores.
3. Experimental SetupExperiments were conducted on multiple benchmark datasets, including ImageNet, MS-COCO, and Pascal-VOC. For the zero-shot setting, ViT-B/16 was used as the backbone network; for few-shot settings, CoOp and LoCoOp methods were integrated.
b) Key Results1. ImageNet Benchmark TestExperimental results show that GL-MCM outperforms MCM in most settings, especially in complex scenarios. For example, on the iNaturalist dataset, GL-MCM reduced FPR95 (false positive rate) by 13.7% and improved AUROC (area under the curve) by 2.8%.
2. MS-COCO and Pascal-VOC Benchmark TestsOn multi-object datasets, GL-MCM also performs exceptionally well. For instance, in the Pascal-VOC dataset, GL-MCM achieved an average AUROC of 93.81%, significantly higher than MCM’s 88.08%.
3. Parameter Sensitivity AnalysisBy adjusting the (\lambda) parameter, the authors verified the flexibility of GL-MCM. Experiments found that larger (\lambda) values are more suitable for detecting images containing both ID and OOD objects, while smaller (\lambda) values are better suited for detecting ID-dominant images.
c) Conclusion and SignificanceScientific ValueGL-MCM provides a simple yet effective method to address the limitations of traditional zero-shot out-of-distribution detection methods in multi-object scenarios. It not only improves detection performance but also demonstrates high flexibility, adapting to various application scenarios.
Application ValueThe high scalability of GL-MCM allows it to be easily integrated into existing few-shot learning frameworks, further enhancing performance. Additionally, its no-extra-training characteristic reduces the cost of practical applications.
d) Research HighlightsInnovative Method

GL-MCM is the first to introduce local features into zero-shot out-of-distribution detection, addressing the shortcomings of traditional methods.
Flexibility

By adjusting the (\lambda) parameter, users can choose appropriate detection strategies according to specific needs.
Efficiency

GL-MCM outperforms existing methods in inference speed and GPU memory consumption.
e) Other Valuable InformationThe authors also explored the integration of GL-MCM with other localization methods (such as SAN and Grounding DINO), further validating its versatility and efficiency.
SummaryGL-MCM is an innovative and practical zero-shot out-of-distribution detection method that significantly improves detection performance and flexibility by combining global and local features. Its research outcomes not only advance the field of computer vision but also provide crucial technical support for practical applications.