Day2Dark: Pseudo-Supervised Activity Recognition Beyond Silent Daylight

2024-11-24 Sun
activity recognition low-light conditions audio-visual fusion pseudo-supervised learning darkness adaptation
Research Highlights: Low-Light Activity Recognition Based on Pseudo-Supervision and Adaptive Audio-Visual FusionAcademic ContextThis paper investigates the challenges of recognizing activities under low-light conditions. While existing activity recognition technologies perform well in well-lit environments, they often fail when dealing with low-light videos. This limitation arises mainly due to two reasons: the lack of labeled low-light training data and the reduced color contrast in low-light environments, leading to loss of visual information. Additionally, traditional solutions based on video image enhancement improve image quality to some extent but often introduce problems such as color distortion and discontinuity between video frames, which negatively impact activity recognition tasks.
Low-light activity recognition is crucial in various application domains, including smart homes, autonomous driving, security monitoring, and wildlife observation. Therefore, this study proposes a novel method that significantly improves activity recognition performance in low-light environments by combining pseudo-supervised learning and adaptive audio-visual fusion techniques.
Research OriginThis study was conducted by Yunhua Zhang and Cees G. M. Snoek from the University of Amsterdam and Hazel Doughty from Leiden University. The paper was published in International Journal of Computer Vision in 2024.
Research Workflow and MethodologyOverview of the Proposed MethodThis paper introduces a framework called “Day2Dark” to address the challenges of low-light activity recognition through two major innovations:
1. Pseudo-Supervised Learning Strategy: Utilizes widely available unlabeled low-light video data to compensate for the lack of labeled data.
2. Adaptive Audio-Visual Fusion Recognizer: Dynamically adjusts the weights of visual and audio features based on video illumination conditions, enabling more effective fusion of the two modalities.
Research Workflow1. Pseudo-Supervised Learning in Day2DarkStage One: Pseudo-Supervised Learning

The study employs multiple self-supervised models (e.g., video-text matching and sound source localization tasks) to generate pseudo-labels for unlabeled low-light videos. These labels are compressed into more abstract representations through an autoencoder to reduce potential overfitting during training.
Stage Two: Day2Dark Mix Fine-Tuning

The proposed Day2Dark-Mix strategy combines labeled daylight videos with unlabeled low-light videos to generate new video samples. This enables the model to better adapt to low-light distributions while maintaining recognition performance under daylight conditions.
2. Adaptive Audio-Visual Fusion ModelVisual and Audio Feature Extraction

Visual features are extracted using a pre-trained visual encoder, while audio features are extracted using a single-modality encoder.
Illumination-Adaptive Module

A “Darkness Probe” evaluates the clarity of visual features in the video and assigns weights to different branches for adaptive adjustments. These weights are used in both the visual feature projection layer and the prompt generation stage for audio-visual fusion.
Audio-Visual Fusion and Classification

An audio-visual transformer is used to fuse the adjusted visual features, adaptive prompts, and audio features, resulting in robust activity recognition.
ResultsDatasets and Experimental SetupExperiments were conducted on multiple public datasets, including EPIC-Kitchens, Kinetics-Sound, and Charades. These datasets cover various scenarios and lighting conditions in multi-modal videos.
Performance EvaluationThe proposed method significantly outperforms existing techniques in low-light activity recognition:
1. Comparison with Traditional Methods

- On EPIC-Kitchens, the Day2Dark method improved recognition accuracy for low-light videos by approximately 7% compared to baseline models and significantly outperformed image enhancement and traditional audio-visual fusion methods.
- On Kinetics-Sound, the method achieved a 5.2% accuracy improvement under low-light conditions.
Adaptiveness Validation
The illumination-adaptive module dynamically adjusts the model’s branch weights based on varying lighting conditions, enabling effective recognition even in extremely dark environments.
Robustness Validation
The method improves not only low-light performance but also robustness under conditions like occlusion and daylight scenarios.
Research Significance and InnovationsScientific Value

This paper introduces pseudo-supervised learning and illumination-adaptive audio-visual fusion strategies for the first time in the field of low-light activity recognition, offering a solution that does not rely on labeled data.
Practical Application Potential

The method can be applied in multiple domains, such as smart surveillance and autonomous driving, especially in scenarios where collecting large-scale labeled data is challenging.
Technical Innovations

Proposed the Day2Dark-Mix strategy, which effectively combines labeled daylight videos and unlabeled low-light videos, enhancing the model’s adaptability.
The illumination-adaptive module significantly reduces the impact of visual distribution shifts on model performance.
Future WorkThe authors suggest exploring additional task-relevant self-supervised tasks to further optimize pseudo-label generation. Moreover, the illumination-adaptive module could be extended to address distribution shift problems in other video conditions, such as weather changes or motion blur.
ConclusionThrough rigorous research workflow and innovative technical designs, this paper provides a novel solution for low-light activity recognition. By combining pseudo-supervised learning and adaptive audio-visual fusion, the Day2Dark method demonstrates practicality and superiority across multiple datasets, paving the way for new research directions in computer vision tasks under low-light conditions.