Continual Learning of Conjugated Visual Representations through Higher-Order Motion Flows
Continual Learning of Conjugated Visual Representations through Higher-Order Motion Flows: A Study on the CMOSFET Model
Academic Background
In the fields of artificial intelligence and computer vision, continual learning from continuous visual data streams has long been a challenge. Traditional machine learning methods typically rely on the assumption of independent and identically distributed (i.i.d.) data, where all training data is static and available during training. However, visual data in the real world is often continuous and non-i.i.d., posing significant difficulties for model training. Moreover, most existing unsupervised learning methods depend on large-scale offline training datasets, which is fundamentally different from how humans and animals learn through continuous environmental experiences.
To address these issues, researchers Simone Marullo, Matteo Tiezzi, Marco Gori, and Stefano Melacci proposed a novel unsupervised continual learning model called CMOSFET (Continual Motion-based Self-supervised Feature Extractor). The core idea of this model is to guide feature extraction through motion information, enabling online learning from a single video stream. Motion information plays a crucial role in visual perception, as early psychological studies (e.g., the Gestalt principles) have shown that motion is a fundamental cue for visual perception. Therefore, the CMOSFET model estimates multi-level motion flows (from traditional optical flow to higher-order motion flows) to guide feature extraction, thereby achieving continual learning of visual representations.
Source of the Paper
This paper was co-authored by Simone Marullo (Department of Information Engineering, University of Florence), Matteo Tiezzi (Italian Institute of Technology), Marco Gori, and Stefano Melacci (Department of Information Engineering and Mathematics, University of Siena). It was published in the journal Neural Networks in 2025. The paper, titled Continual Learning of Conjugated Visual Representations through Higher-Order Motion Flows, explores how to achieve continual learning of visual representations through higher-order motion flows.
Research Process
1. Model Design
The core of the CMOSFET model is a dual-branch neural network architecture, designed for extracting pixel-level features and estimating pixel-level motion flows. The input to the model is a continuous sequence of frames, each with a resolution of W×H. The goal of the model is to progressively extract robust features from the video stream and estimate motion flows at multiple levels of abstraction.
1.1 Multi-Level Feature Flows
The CMOSFET model extracts features and motion flows at multiple levels. The feature extractor at each level (f^l_t) receives output from the previous layer and generates feature representations for the current layer. Simultaneously, the motion flow estimator at each level (δ^l_t) estimates the motion flow for that layer based on the features of the current and previous layers. In this way, the model can estimate not only traditional low-order optical flow but also higher-order motion flows, which are typically associated with more abstract features.
1.2 Conjugation Between Features and Motion
A key innovation of the CMOSFET model lies in the conjugation between features and motion flows. Specifically, the model enforces consistency between features and motion flows through a conjugation loss function (L^l_conj). This loss function consists of three components: (i) consistency between features and motion flows at the current level; (ii) consistency between features at the current level and motion flows at the first level; and (iii) consistency between motion flows at the current level and features at the previous level. This ensures that features and motion flows remain consistent across different levels.
2. Self-Supervised Contrastive Learning
To prevent the model from collapsing into trivial solutions (e.g., generating spatially uniform features), CMOSFET introduces a self-supervised contrastive loss function (L^l_self). This loss function determines positive and negative sample pairs based on motion information. Specifically, positive pairs consist of pixels with similar motion patterns, while negative pairs consist of pixels with different motion patterns. This approach enhances the discriminative power of the features through motion information.
2.1 Sampling Strategy
Due to the high computational cost of the contrastive loss, CMOSFET employs a sampling strategy based on motion and feature activations. Specifically, the model selects a group of pixels for contrastive learning based on motion information and feature activation levels. This sampling strategy not only reduces computational costs but also ensures that the model focuses on important regions in the video stream.
3. Learning Over Time
The CMOSFET model processes each pair of consecutive frames in an online learning manner. The model achieves temporal stability through a combination of a fast learner (GRA) and a slow learner (EMA). The fast learner updates parameters via gradient descent, while the slow learner updates parameters via exponential moving average (EMA). This approach allows the model to maintain learning capabilities while mitigating the issue of catastrophic forgetting.
Main Results
1. Experimental Setup
The CMOSFET model was evaluated on multiple video streams, including synthetic 3D environment videos and real-world videos. The primary goal of the experiments was to assess the model’s feature extraction capabilities through pixel-level classification tasks. Specifically, the model extracted features during the unsupervised learning phase and used these features for classification in the evaluation phase.
2. Quantitative Results
The experimental results showed that the CMOSFET model outperformed existing unsupervised continual learning models on multiple video streams. Notably, the model demonstrated significant advantages on real-world videos (e.g., rat and horse). Additionally, the CMOSFET model has a relatively small number of parameters (2.3M), which is much lower than its main competitor (17.8M). This indicates that CMOSFET can generate more compact yet discriminative feature representations.
3. Qualitative Results
Through visual analysis, the researchers found that the CMOSFET model accurately estimated motion flows in the videos and generated discriminative feature representations. Particularly in videos with complex backgrounds, CMOSFET effectively separated target objects and performed well in classification tasks.
Conclusions and Significance
The CMOSFET model successfully achieved unsupervised continual learning from a single video stream by introducing multi-level motion flows and self-supervised contrastive learning. The model not only generates discriminative feature representations but also estimates motion flows at multiple levels of abstraction. The experimental results demonstrate that CMOSFET outperforms existing unsupervised continual learning models on multiple video streams and performs well on real-world videos.
Highlights of the Research
- Multi-Level Motion Flows: The CMOSFET model achieves continual learning of visual representations by estimating multi-level motion flows. This innovation allows the model to capture motion information at different levels of abstraction.
- Self-Supervised Contrastive Learning: By introducing a motion-based contrastive loss function, CMOSFET avoids trivial solutions and generates discriminative feature representations.
- Online Learning and Temporal Stability: CMOSFET combines a fast learner and a slow learner to achieve temporal stability in online learning, reducing the issue of catastrophic forgetting.
Future Work
Although the CMOSFET model performs well on multiple video streams, it still has some limitations. For example, the model may struggle with strongly moving backgrounds or static scenes. Future research could explore how to integrate more advanced continual learning strategies to handle longer video streams or more object categories. Additionally, researchers could investigate the application of CMOSFET to other visual tasks, such as object detection and semantic segmentation.
Summary
The CMOSFET model successfully achieves unsupervised continual learning from a single video stream by introducing multi-level motion flows and self-supervised contrastive learning. This research not only provides new insights for continual learning in computer vision but also offers important references for the design of future artificial intelligence systems.