MetaCoorNet: An Improved Generative Residual Network for Grasping Pose Estimation

2025-02-03 Mon
robot grasping generative residual network coordinate attention feature fusion meta light block real-time accuracy transparent objects
A New Breakthrough in Robotic Grasp Pose Estimation — MetaCoorNet NetworkAcademic Background and Research ChallengesRobotic grasping is a fundamental challenge in the field of robotics, centered on enabling robots to interact with their environment to perform object picking and manipulation tasks. Despite the immense potential applications of automated grasping technologies in areas such as industrial manufacturing, domestic assistance, and component assembly, significant challenges hinder its adoption. For instance, the diversity in object shapes, sizes, and materials, alongside complex environmental factors such as occlusion and lighting variations, poses severe challenges to the stability and realism of grasping algorithms. Additionally, sensor noise and the intricate mechanical design of grippers add further difficulty in realizing high-precision robotic grasping.
In this context, grasp pose estimation has become a critical technology that governs the robotic grasping process. Grasp pose estimation is framed as a regression problem, aimed at predicting the optimal grasp points and corresponding angles based on visual input data (e.g., RGB images or point clouds). In recent years, with the rapid advancement of deep learning technologies, researchers have increasingly attempted to solve this problem through neural networks. However, existing methods are still plagued by high computational complexity, large training data requirements, and insufficient generalizability.
To address these issues and advance the field further, researchers Hejia Gao, Chuanfeng He, Junjie Zhao from Anhui University, and Changyin Sun from Southeast University proposed a lightweight deep learning model called MetaCoorNet (MCN), which builds upon the generative residual convolutional neural network with refinements. This paper introduces the innovations of the MetaCoorNet method and demonstrates its exceptional performance on the Cornell and Jacquard benchmark grasping datasets.
Source and Publication InformationThis study was conducted by research teams from Anhui University, Southeast University, and their affiliated labs. It was published in the March 2025 issue of Science China Information Sciences, Volume 68, Issue 3. The paper was made available online in January 2025, and its corresponding DOI is 10.1007/s11432-024-4157-7.
Research Methodology and WorkflowResearch Workflow:

The authors propose the MetaCoorNet network, which consists of four major components: an input layer, a feature extraction layer, a feature fusion layer, and an output layer. Each component integrates unique and efficient modules to enhance network performance. Experiments performed on the Cornell and Jacquard public datasets, accompanied by real-world robot grasping experiments, validate the effectiveness and robustness of the method.
1. Network Architecture Design:
Input Layer:

The input layer receives preprocessed multi-channel image data, such as RGB-D images, and extracts initial features through a convolutional layer with 32 filters.
Feature Extraction Layer:

The feature extraction layer consists of two MetaCoorBlocks (MCBs) and three Residual Blocks, combined with Coordinate Attention (CA). The MCB modules enhance feature selection efficiency by embedding positional information into the channel attention, while the 3×3 convolution kernel captures spatial features in the image. Residual Blocks prevent vanishing gradients and enable the network to stably learn deeper features.
Feature Fusion Layer:

This layer includes a repeated spatial convolution module (RepSO), a channel refinement module (RefCO), and a convolution fusion block (CFB). RepSO enhances spatial information, RefCO improves feature discriminability through attention mechanisms, while CFB systematically integrates spatial and channel features, resulting in high-dimensional expressive features.
Output Layer:

The output layer uses transpose convolutions to restore the spatial resolution of feature maps to the input image resolution. It outputs key grasp parameters such as grasp quality score, angle, and gripping width through several convolutional layers.
2. Experiment Design:
Public Dataset Testing:

The Cornell dataset (8,019 grasp annotations) and Jacquard dataset (4.96 million grasp annotations) were used. The Adam optimizer was employed with a learning rate of 0.001, a batch size of 8, and 50 epochs for training.
Real-Robot Experimental Validation:

A Kinova robotic arm with seven degrees of freedom and an Intel RealSense D435 camera were used to conduct real-world experiments. The experiments evaluated tasks such as single-object, multi-object, and transparent object grasping, using metrics like grasping success rates and execution speed.
Key Results and FindingsDataset Experiment Results:

MetaCoorNet achieved a prediction accuracy of 98% on the Cornell dataset and 91.2% on the Jacquard dataset, significantly outperforming existing methods. These results demonstrate MCN’s ability to adapt to various object shapes and environmental complexities.
Performance Analysis and Speed Comparison:

With an inference time of just 20 milliseconds (on par with the fastest models), MCN showcased exceptional efficiency and real-time capability.
Robotic Grasping Experiments:

In real-world environments, MCN excelled at handling occlusions, diverse physical characteristics, and scene variability. Its grasp success rate reached 93.6%, showcasing its industrial application potential.
Research Value and SignificanceMetaCoorNet optimizes network architecture by proposing a lightweight and efficient approach to grasp pose estimation, addressing many challenges faced by existing grasping algorithms. Additionally, the modules introduced in the study (e.g., MCB and CFB) have potential applications in other vision-related tasks, such as object detection and pose estimation.
Highlights:
1. Innovative integration of spatial and channel information with position embedding, improving grasping accuracy.

2. Lightweight and efficient design, suitable for real-time grasping tasks.

3. Handles complex scenarios, enabling multi-object grasping and transparent object handling.
Outlook and Future DirectionsThe authors identified certain limitations of the study, such as the reliance on synthetic data and challenges in multi-object grasping tasks. They proposed several future research directions:
Integration with Real-World Data: Improve robustness and generalizability under sensor noise and varying lighting conditions.

Adaptivity to Diverse Grippers: Design versatile grasp representations tailored to various end-effectors.

Incorporation of Physical Constraints: Integrate robot kinematics, dynamics, and environmental constraints for more reliable grasp planning and execution.

Exploration of Multi-Object Manipulation: Extend MCN capabilities to handle simultaneous multi-object grasping and manipulation.
MetaCoorNet provides a new perspective and technical approach for robotic grasping, holding significant promise for advancing automation in industries, service robots, assistive technologies, and more.