Self-Supervised Feature Detection and 3D Reconstruction for Real-Time Neuroendoscopic Guidance
Research on Real-Time 3D Reconstruction and Navigation in Neuroendoscopy Based on Self-Supervised Learning
Academic Background
Neuroendoscopic surgery, as a minimally invasive surgical technique, is widely used in the treatment of deep brain lesions, such as endoscopic third ventriculostomy (ETV), choroid plexus cauterization, and cyst fenestration. However, during surgery, geometric deformation of deep brain structures occurs due to brain shift and cerebrospinal fluid (CSF) loss, posing challenges to traditional neuronavigation based on preoperative imaging. Conventional navigation systems typically rely on rigid registration of preoperative magnetic resonance (MRI) or computed tomography (CT) images, unable to update intraoperative tissue deformation in real-time, resulting in reduced navigation accuracy.
To address this issue, the research team proposed a feature detection method based on self-supervised learning, combined with simultaneous localization and mapping (SLAM) technology, achieving real-time 3D reconstruction and navigation of neuroendoscopic video. This method aims to improve the robustness of feature detection by extracting features from unlabeled endoscopic video data through self-supervised learning, providing real-time and precise navigation support during surgery.
Paper Source
This paper was jointly completed by multiple researchers from the Department of Computer Science and the Department of Biomedical Engineering at Johns Hopkins University. The main authors include Prasad Vagdargi, Ali Uneri, Stephen Z. Liu, etc. The paper was published in the 2025 issue of the journal IEEE Transactions on Biomedical Engineering, titled “Self-Supervised Feature Detection and 3D Reconstruction for Real-Time Neuroendoscopic Guidance.” The research was funded by the National Institutes of Health (NIH) and Medtronic.
Research Process and Results
1. Data Collection and Preprocessing
The research team collected 11,527 frames of video data from 15 clinical neuroendoscopic surgeries for training and validating the self-supervised learning model. The video segments for each surgery ranged from 10 to 47 seconds in length, with a frame rate of 30 frames per second. The video data underwent geometric correction and cropping to ensure that only the effective area within the endoscopic field of view was included. Additionally, the research team performed various data augmentations on the video frames, including spatial transformations (such as rotation, scaling, perspective distortion) and intensity transformations (such as brightness, contrast, noise, glare, etc.), to simulate common image artifacts during surgery.
2. Development and Training of the Self-Supervised Feature Detection Model
The research team developed a model named R2D2-E, which is based on the R2D2 (Repeatable and Reliable Detector and Descriptor) architecture and specifically designed for feature detection in neuroendoscopic videos. The R2D2-E model learns keypoint detection, local descriptors, and descriptor reliability through a dual-branch network structure. The model’s training adopts a self-supervised learning method, generating pseudo-ground truth by applying random spatial transformations and image domain transformations to image pairs, thereby avoiding dependence on manually labeled data.
During the training process, the research team used 5-fold cross-validation, dividing the 15 cases into 12 for the training set and 3 for the validation set. The model was optimized using the Adam optimizer with a learning rate of 10^-3 over 30 epochs. During training, the research team also conducted hyperparameter selection experiments, including adjustments to the learning rate and patch size, to determine the optimal parameter combination.
3. Feature Matching and 3D Reconstruction
The R2D2-E model achieved feature matching by detecting keypoints in the images and calculating their descriptors. During the matching process, the research team used the MAGSAC (Marginalizing Sample Consensus) algorithm for filtering, removing mismatches that did not conform to the homography model. Successfully matched feature points were used to estimate camera pose and generate a sparse 3D point cloud through triangulation. The point cloud was statistically filtered to remove noise and ultimately registered with preoperative MRI images.
4. Experimental Results and Performance Evaluation
The research team quantitatively evaluated the feature matching and 3D reconstruction performance of the R2D2-E model and compared it with traditional feature detection methods (such as SIFT, SURF) and learning-based methods (such as SuperPoint). The experimental results showed that R2D2-E demonstrated superior performance in feature matching and 3D reconstruction:
- Feature Matching: The median keypoint error (KPE) of R2D2-E was 0.83 pixels, significantly lower than 2.20 pixels for SIFT and 1.70 pixels for SURF. Additionally, the median track length of R2D2-E was 19 frames, outperforming other methods.
- 3D Reconstruction: The median projected error (PE) of R2D2-E was 0.64 mm, lower than 0.90 mm for SIFT and 0.99 mm for SURF. In terms of F1 score, R2D2-E achieved an F1 score of 0.72 at a 1 mm distance threshold, improving by 14% and 25% compared to SIFT and SURF, respectively.
5. Real-Time Navigation and Augmented Visualization
The research team also developed an augmented visualization system that fuses target structures segmented from preoperative MRI images with real-time endoscopic video. Through point cloud registration and 3D rendering of target structures, the system provides real-time spatial context information during surgery, helping surgeons more accurately locate target structures.
Conclusion and Significance
This study shows that the R2D2-E model can significantly improve the accuracy of feature detection and 3D reconstruction in neuroendoscopic surgery, providing strong support for real-time navigation. Compared with traditional feature detection methods, R2D2-E not only has higher matching accuracy and lower projection error but also demonstrates higher robustness in handling various endoscopic artifacts (such as glare, blur, etc.) during surgery. Additionally, the development of the augmented visualization system provides new navigation tools for neuroendoscopic surgery, potentially enhancing the precision and safety of the procedure.
Research Highlights
- Self-Supervised Learning Method: The R2D2-E model extracts features from unlabeled endoscopic video data through self-supervised learning, avoiding reliance on manually labeled data, significantly improving the model’s versatility and robustness.
- Real-Time 3D Reconstruction and Navigation: Combined with SLAM technology, R2D2-E achieves real-time 3D reconstruction of neuroendoscopic video, providing real-time and accurate spatial information for intraoperative navigation.
- Augmented Visualization System: By fusing preoperative MRI images with real-time endoscopic video, the system can provide 3D visualization of target structures during surgery, helping surgeons more accurately locate targets.
Other Valuable Information
In the paper, the research team also detailed the technical implementation details of the R2D2-E model, including network architecture, loss functions, and training strategies, providing valuable references for subsequent research. Additionally, the research team open-sourced related code and datasets to promote further research and development in this field.
Through the success of this study, the R2D2-E model and its augmented visualization system are expected to be widely applied in future neuroendoscopic surgeries, providing more precise and safer navigation support for the treatment of deep brain lesions.