A Memory-Assisted Knowledge Transferring Framework with Curriculum Anticipation for Weakly Supervised Online Activity Detection

2024-11-24 Sun
weakly supervised learning online activity detection knowledge distillation memory networks curriculum learning
Research Background and SignificanceIn recent years, weakly supervised online activity detection (WS-OAD), as an important topic in high-level video understanding, has garnered widespread attention. Its primary goal is to detect ongoing activities frame-by-frame in streaming videos using only inexpensive video-level annotations. This task holds significant value in many practical applications, including autonomous driving, public surveillance, robotic navigation, and augmented reality.
Although fully supervised methods have made remarkable progress in online activity detection (OAD), their heavy reliance on dense frame-level annotations is costly and prone to noise, limiting model scalability. The weakly supervised paradigm aims to address this issue, but the challenges of online constraints and sparse supervision signals still hinder current methods in classifying activities and identifying start points. Therefore, effectively leveraging offline knowledge to enhance model performance is the central problem tackled in this study.
To address these issues, the paper titled “A Memory-Assisted Knowledge Transferring Framework with Curriculum Anticipation for Weakly Supervised Online Activity Detection” proposes a memory-assisted knowledge distillation framework that integrates a curriculum learning strategy for gradually anticipating future semantics, thereby improving the effectiveness of online activity detection tasks.
Source of the Paper and Author BackgroundThis paper was jointly authored by Tianshan Liu and Bing-Kun Bao from Nanjing University of Posts and Telecommunications, Kin-Man Lam from The Hong Kong Polytechnic University, and researchers from Peng Cheng Laboratory in Shenzhen. It was published in the International Journal of Computer Vision (DOI: 10.1007/s11263-024-02279-1). The manuscript was received on July 19, 2023, and officially accepted on October 10, 2024.
Research Methodology and Technical FrameworkOverall Framework DesignThe proposed model is based on a teacher-student architecture, wherein:
1. Teacher Model: Operates offline, learning complete contextual information from entire video sequences and storing activity prototypes in an external memory bank.
2. Student Model: Operates online, using only current and historical observations for frame-by-frame predictions, and gradually learns to anticipate unseen future semantics through a curriculum learning strategy.
Framework Features:
- Memory Assistance: An external memory bank stores long-term activity prototypes learned by the offline model, bridging the gap between offline and online models.
- Curriculum Learning: Dynamically adjusts the proportion of future states provided to the student model, progressively training it from “easy to hard” to handle the absence of future information.
Technical Details1. Memory-Assisted Teacher-Student ArchitectureThe Teacher Model generates activity prediction scores by processing full video sequences and stores long-term activity semantics in a memory bank. The prototypes in the memory bank are associated with input frames via a cosine similarity mechanism, providing contextual information for the student model.
The Student Model, constrained to current observations, gradually learns future semantics through a curriculum learning strategy. By initially introducing actual future frames and gradually replacing them with learnable queries, the student model learns to make accurate predictions at each time step without future information.
2. Curriculum Learning StrategyThe paper adopts Dynamic Curriculum Learning, which adjusts curriculum difficulty based on prediction quality. Initially, the student model trains with sufficient future semantic assistance and gradually reduces the proportion of future information, replacing it with learnable queries to enhance inferencing capability. This adaptive strategy effectively avoids the accumulation of prediction errors.
3. Knowledge Distillation MechanismThe paper employs Dual-Level Knowledge Distillation:
- Representation-Level Distillation: Emphasizes frame boundaries to enable the student model to mimic local features learned by the teacher model more accurately.
- Prediction-Level Distillation: Guides the student model with frame-level pseudo-labels generated by the teacher model, providing finer-grained supervision.
Innovations and HighlightsIntroduction of Memory Bank: Stores long-term activity prototypes to bridge the teacher and student models while supporting stable predictions during inference.
Enhanced Curriculum Learning: Gradually adapts the student model to predict without future information.
Dual-Level Distillation Strategy: Transfers both representation-level and prediction-level knowledge to improve detection capabilities.
Experimental Design and ResultsDatasets and Evaluation MetricsExperiments were conducted on three public datasets: THUMOS14, ActivityNet1.2, and ActivityNet1.3, with evaluation metrics including Mean Frame-wise Average Precision (F-AP) and Point-wise Average Precision (P-AP).
Results Analysis1. Overall Performance ComparisonOn the THUMOS14 dataset, the proposed method achieved 55.6% F-AP under weak supervision, outperforming all baselines. On ActivityNet1.2, it achieved 68.3% F-AP, showing significant improvements.
For Activity-Start Detection, the proposed method consistently outperformed existing methods across all evaluated time-difference thresholds, achieving at least a 0.6% improvement in low-threshold scenarios (e.g., 1 second).
2. Role of the Memory BankAblation experiments demonstrated the critical role of the memory bank in storing and recalling activity semantics. Constraining the memory with sparsity regularization further improved detection performance by reducing background interference.
3. Effectiveness of Curriculum LearningCompared with fixed curriculum strategies (e.g., linear or exponential scheduling), the dynamic curriculum learning strategy yielded better performance, highlighting the importance of adjusting curriculum difficulty based on the quality of future context predictions.
4. Contribution of Dual-Level Knowledge DistillationRepresentation-level and prediction-level distillation contributed 2.1% and 10.7% improvements in performance, respectively. Their combination achieved the best detection results.
Visualization and Intuitive AnalysisDetection Results Visualization: The proposed method accurately captured activity boundaries and maintained high detection confidence in complex scenarios.
Feature Representation Visualization: T-SNE projections showed that memory-augmented features were more compact within each class and retained inter-class relationships.
Conclusions and Future DirectionsThis paper introduces a memory-assisted knowledge distillation framework enhanced with curriculum learning for weakly supervised online activity detection. Its dynamic curriculum learning strategy and dual-level distillation mechanism provide new insights for the field. Future research could explore extending this framework to more real-world applications and further optimizing computational efficiency.