LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models

Dynamic People Tracking by Service Robots Using Diffusion Models

Academic Background

Tracking dynamic people in cluttered and crowded human-centered environments is a challenging problem in robotics. Due to intraclass variations such as occlusions, pose deformations, and lighting changes, traditional tracking methods often struggle to accurately identify and track targets. Existing robotic tracking methods typically rely on separate detection and tracking systems, which can be inefficient and prone to failure when dealing with intraclass variations, especially if the detector fails to identify a person in consecutive frames.

To address these issues, this paper proposes a novel deep learning architecture based on Conditional Latent Diffusion Models, called Latent Diffusion Track (LDTrack). This architecture captures temporal person embeddings, allowing it to adapt to changes in a person’s appearance over time, thereby enabling efficient multi-target tracking in complex and crowded environments.

Paper Source

The paper is co-authored by Angus Fung, Beno Benhabib, and Goldie Nejat from the Autonomous Systems and Biomechatronics Laboratory (ASBLab) at the University of Toronto. It was accepted on December 17, 2024, and published in 2025 in the International Journal of Computer Vision.

Research Process and Results

Research Process

  1. Architecture Design:

    • The LDTrack architecture consists of two subsystems: training and inference. The inference subsystem extracts person feature embeddings from RGB images and generates person trajectories through the Iterative Track Refinement Network (ITRN). The training subsystem converts ground truth bounding boxes into high-dimensional latent space representations using the Latent Feature Encoder Network (LFEN), and generates noisy box embeddings via the Latent Box Diffusion (LBD) module.
  2. Inference Subsystem:

    • Self-Attention Feature Extraction Network (SFEN): Uses ResNet-18 and a Transformer encoder to extract person feature embeddings.
    • Iterative Track Refinement Network (ITRN): Refines noisy box embeddings iteratively through a Transformer decoder to generate person trajectories.
  3. Training Subsystem:

    • Latent Feature Encoder Network (LFEN): Converts ground truth bounding boxes into high-dimensional latent space representations.
    • Latent Box Diffusion (LBD): Generates noisy box embeddings through a Markovian-chain-driven diffusion process.
    • Iterative Track Refinement Network (ITRN): Generates person bounding boxes and class predictions through a reverse diffusion process.

Key Results

  1. Tracking Accuracy and Precision:

    • LDTrack performed exceptionally well across multiple datasets, particularly in complex and crowded environments. On the InOutdoor (IOD) dataset, LDTrack achieved a MOTA (Multiple Object Tracking Accuracy) of 78.6%, significantly outperforming other methods.
    • On the Kinect Tracking Precision (KTP) dataset, LDTrack achieved a MOTA of 92.7%, representing a 5-62% improvement over existing methods.
  2. Multi-Object Tracking Comparison:

    • LDTrack also outperformed state-of-the-art multi-object tracking methods on the MOT17 and MOT20 datasets, especially in high-density crowd scenarios.
  3. Ablation Study:

    • The ablation study validated key design choices in LDTrack, including the use of a single timestep for embeddings, 500 box embeddings, and a 288-dimensional latent space.

Conclusions and Significance

LDTrack introduces conditional latent diffusion models to dynamically update person track embeddings, enabling the architecture to adapt to changes in a person’s appearance over time. The architecture demonstrated superior performance across multiple datasets, particularly in handling intraclass variations such as occlusions, pose deformations, and lighting changes. The success of LDTrack not only highlights the potential of diffusion models in robotic tracking tasks but also provides a new direction for real-time applications.

Research Highlights

  1. Innovation: LDTrack is the first architecture to apply conditional latent diffusion models to robotic dynamic people tracking, effectively addressing intraclass variations.
  2. Efficiency: By integrating detection and tracking into a single framework, LDTrack offers significant advantages in computational efficiency and real-time performance.
  3. Generalizability: LDTrack not only excels in human-centered environments but also generalizes well to multi-object tracking tasks in urban settings.

Future Work

Future research will explore integrating contrastive learning methods (e.g., TIMCLR) with LDTrack to further learn person representations that are invariant under intraclass variations. Additionally, real-time testing of LDTrack in real-world environments will be conducted to validate its performance in practical applications.