Learning Structure-Supporting Dependencies via Keypoint Interactive Transformer for General Mammal Pose Estimation
Advances in General Mammal Pose Estimation Research
Research Background and Problem Statement
In the field of computer vision, pose estimation is a fundamental and crucial task aimed at locating key points of target objects in images. In recent years, human pose estimation has made significant progress, but research on animal pose estimation remains in its infancy. Compared to human pose estimation, animal pose estimation faces greater challenges, mainly reflected in the following aspects:
- Species Diversity: The appearance and posture differences between different species are substantial. For example, leopards and domestic cats within the feline family exhibit significant distinctions in shape, size, and color.
- Data Scarcity: Existing animal pose datasets are much smaller than human pose datasets. For instance, the largest mammal pose dataset, AP-10k, contains approximately 10,000 images, whereas the COCO dataset comprises over 200,000 annotated images.
- Complexity of Posture Variations: Animals exhibit a wider range of postural changes. For example, when an antelope stands, the distance between its nose and eyes is relatively close, but when it lowers its head to drink water, the distance between its nose and front paw significantly shortens.
To address these issues, researchers have proposed various methods, but most studies focus only on optimizing for specific species, lacking generality. Therefore, designing a model capable of adapting to multi-species pose estimation has become an urgent problem to solve.
This paper was authored by Tianyang Xu et al., with authors from the School of Artificial Intelligence and Computer Science at Jiangnan University and the School of Computer Science and Electronic Engineering at the University of Surrey, UK. The paper was received on January 6, 2025, and published in the journal International Journal of Computer Vision.
Research Content and Workflow
a) Research Process and Methods
The core contribution of this study is a novel architecture called Keypoint Interactive Transformer (KIT), designed to learn instance-level structure-supporting dependencies to achieve general mammal pose estimation. Below are the main processes and methods of the research:
1. Data Preprocessing and Feature Extraction
The study conducts experiments based on datasets such as AP-10k, Animal Kingdom, and COCO. Input images are first processed through a high-resolution network (HRNet) to extract keypoint features. HRNet is renowned for its high-resolution representation capabilities, which can capture fine-grained spatial information. Subsequently, the feature maps are adjusted for channel numbers via convolutional layers and flattened into keypoint tokens.
2. Keypoint Interactive Transformer (KIT)
The KIT module is one of the core innovations of this study, with its main functions including: - Self-Attention Mechanism: Captures global relationships among keypoints through single-head self-attention while suppressing irrelevant cues. - Body Part Prompts: Generates body part prompts by clustering keypoint tokens, enhancing the model’s understanding of semantics by incorporating contextual information. - Hierarchical Interaction: The KIT module is constructed in a stacked manner, with each layer implementing interaction among keypoints through the self-attention mechanism.
3. Loss Function Design
To optimize intermediate feature representations, the study proposes a Generalized Heatmap Regression Loss (GHL). GHL dynamically adjusts the sharpness of intermediate features by applying Laplacian filtering and smoothing to heatmaps, thereby better adapting to the distribution characteristics of different keypoints.
4. Adaptive Weight Strategy
The study also introduces an adaptive weight strategy to balance the importance of different keypoints. This strategy dynamically adjusts weights based on the prediction error of each keypoint, guiding the model to focus more on hard-to-detect keypoints.
b) Main Results
1. Performance on the AP-10k Dataset
On the AP-10k validation set, the KITPose model outperforms existing state-of-the-art methods. Specifically: - KITPose-E2C4 achieves 76.6 AP at an input resolution of 256×256, surpassing HRNet-W32 by 2.8 AP. - At higher resolution (384×384), KITPose-E2C4 further improves to 77.9 AP, demonstrating the model’s robustness at lower resolutions.
2. Performance on the Animal Kingdom Dataset
On the more challenging Animal Kingdom dataset, KITPose also performs excellently: - KITPose-E2C6 equipped with HRNet-W32 achieves 58.8 PCK@0.05, surpassing the baseline model HRNet-W32 (58.5 PCK@0.05). - KITPose-E2C6 equipped with HRNet-W48 further improves to 59.1 PCK@0.05, proving the model’s effectiveness in cross-species pose estimation.
3. Generalization Ability on the COCO Dataset
KITPose is not only suitable for animal pose estimation but can also be directly transferred to human pose estimation tasks. On the COCO validation set, KITPose-E2C4 achieves 77.3 AP at an input resolution of 384×288, surpassing existing state-of-the-art methods.
c) Conclusions and Significance
The KITPose model proposed in this study performs excellently across multiple datasets, showcasing its superiority and generalization ability in general mammal pose estimation. The significance of the research is mainly reflected in the following aspects: 1. Scientific Value: By introducing structure-supporting dependencies, KITPose reveals the intrinsic correlations among keypoints, providing new insights for future pose estimation research. 2. Application Value: This model can be widely applied in wildlife conservation, animal behavior analysis, and other fields, offering technical support for ecological research.
d) Research Highlights
- Novel KIT Module: Through self-attention mechanisms and body part prompts, the KIT module effectively captures structure-supporting dependencies among keypoints.
- Generalized Heatmap Regression Loss: Dynamically adjusts the sharpness of intermediate features, enhancing the model’s adaptability to keypoint distributions.
- Adaptive Weight Strategy: Solves the imbalance issue among different keypoints, improving the model’s robustness.
e) Other Valuable Information
The study also explores the impact of different hyperparameters on model performance, such as the number of body part prompts and the size of the Laplacian kernel. Experiments show that an appropriate number of body part prompts and kernel size can significantly enhance model performance.
Summary
This paper, authored by Tianyang Xu et al., was published in the International Journal of Computer Vision, proposing a new architecture named KITPose for general mammal pose estimation. By introducing the keypoint interactive transformer, generalized heatmap regression loss, and adaptive weight strategy, KITPose has achieved excellent performance across multiple datasets. The research not only advances the development of animal pose estimation but also provides valuable references for other tasks in the field of computer vision.