A Memory-Assisted Knowledge Transferring Framework with Curriculum Anticipation for Weakly Supervised Online Activity Detection

Research Background and Significance

In recent years, weakly supervised online activity detection (WS-OAD), as an important topic in high-level video understanding, has garnered widespread attention. Its primary goal is to detect ongoing activities frame-by-frame in streaming videos using only inexpensive video-level annotations. This task holds significant value in many practical applications, including autonomous driving, public surveillance, robotic navigation, and augmented reality.

Although fully supervised methods have made remarkable progress in online activity detection (OAD), their heavy reliance on dense frame-level annotations is costly and prone to noise, limiting model scalability. The weakly supervised paradigm aims to address this issue, but the challenges of online constraints and sparse supervision signals still hinder current methods in classifying activities and identifying start points. Therefore, effectively leveraging offline knowledge to enhance model performance is the central problem tackled in this study.

To address these issues, the paper titled “A Memory-Assisted Knowledge Transferring Framework with Curriculum Anticipation for Weakly Supervised Online Activity Detection” proposes a memory-assisted knowledge distillation framework that integrates a curriculum learning strategy for gradually anticipating future semantics, thereby improving the effectiveness of online activity detection tasks.


Source of the Paper and Author Background

This paper was jointly authored by Tianshan Liu and Bing-Kun Bao from Nanjing University of Posts and Telecommunications, Kin-Man Lam from The Hong Kong Polytechnic University, and researchers from Peng Cheng Laboratory in Shenzhen. It was published in the International Journal of Computer Vision (DOI: 10.1007/s11263-024-02279-1). The manuscript was received on July 19, 2023, and officially accepted on October 10, 2024.


Research Methodology and Technical Framework

Overall Framework Design

The proposed model is based on a teacher-student architecture, wherein: 1. Teacher Model: Operates offline, learning complete contextual information from entire video sequences and storing activity prototypes in an external memory bank. 2. Student Model: Operates online, using only current and historical observations for frame-by-frame predictions, and gradually learns to anticipate unseen future semantics through a curriculum learning strategy.

Framework Features: - Memory Assistance: An external memory bank stores long-term activity prototypes learned by the offline model, bridging the gap between offline and online models. - Curriculum Learning: Dynamically adjusts the proportion of future states provided to the student model, progressively training it from “easy to hard” to handle the absence of future information.


Technical Details

1. Memory-Assisted Teacher-Student Architecture

The Teacher Model generates activity prediction scores by processing full video sequences and stores long-term activity semantics in a memory bank. The prototypes in the memory bank are associated with input frames via a cosine similarity mechanism, providing contextual information for the student model.

The Student Model, constrained to current observations, gradually learns future semantics through a curriculum learning strategy. By initially introducing actual future frames and gradually replacing them with learnable queries, the student model learns to make accurate predictions at each time step without future information.


2. Curriculum Learning Strategy

The paper adopts Dynamic Curriculum Learning, which adjusts curriculum difficulty based on prediction quality. Initially, the student model trains with sufficient future semantic assistance and gradually reduces the proportion of future information, replacing it with learnable queries to enhance inferencing capability. This adaptive strategy effectively avoids the accumulation of prediction errors.


3. Knowledge Distillation Mechanism

The paper employs Dual-Level Knowledge Distillation: - Representation-Level Distillation: Emphasizes frame boundaries to enable the student model to mimic local features learned by the teacher model more accurately. - Prediction-Level Distillation: Guides the student model with frame-level pseudo-labels generated by the teacher model, providing finer-grained supervision.


Innovations and Highlights

  1. Introduction of Memory Bank: Stores long-term activity prototypes to bridge the teacher and student models while supporting stable predictions during inference.
  2. Enhanced Curriculum Learning: Gradually adapts the student model to predict without future information.
  3. Dual-Level Distillation Strategy: Transfers both representation-level and prediction-level knowledge to improve detection capabilities.

Experimental Design and Results

Datasets and Evaluation Metrics

Experiments were conducted on three public datasets: THUMOS14, ActivityNet1.2, and ActivityNet1.3, with evaluation metrics including Mean Frame-wise Average Precision (F-AP) and Point-wise Average Precision (P-AP).


Results Analysis

1. Overall Performance Comparison
  • On the THUMOS14 dataset, the proposed method achieved 55.6% F-AP under weak supervision, outperforming all baselines. On ActivityNet1.2, it achieved 68.3% F-AP, showing significant improvements.
  • For Activity-Start Detection, the proposed method consistently outperformed existing methods across all evaluated time-difference thresholds, achieving at least a 0.6% improvement in low-threshold scenarios (e.g., 1 second).
2. Role of the Memory Bank

Ablation experiments demonstrated the critical role of the memory bank in storing and recalling activity semantics. Constraining the memory with sparsity regularization further improved detection performance by reducing background interference.

3. Effectiveness of Curriculum Learning

Compared with fixed curriculum strategies (e.g., linear or exponential scheduling), the dynamic curriculum learning strategy yielded better performance, highlighting the importance of adjusting curriculum difficulty based on the quality of future context predictions.

4. Contribution of Dual-Level Knowledge Distillation

Representation-level and prediction-level distillation contributed 2.1% and 10.7% improvements in performance, respectively. Their combination achieved the best detection results.


Visualization and Intuitive Analysis

  • Detection Results Visualization: The proposed method accurately captured activity boundaries and maintained high detection confidence in complex scenarios.
  • Feature Representation Visualization: T-SNE projections showed that memory-augmented features were more compact within each class and retained inter-class relationships.

Conclusions and Future Directions

This paper introduces a memory-assisted knowledge distillation framework enhanced with curriculum learning for weakly supervised online activity detection. Its dynamic curriculum learning strategy and dual-level distillation mechanism provide new insights for the field. Future research could explore extending this framework to more real-world applications and further optimizing computational efficiency.