A Foundation Model for Joint Segmentation, Detection and Recognition of Biomedical Objects Across Nine Modalities

Decoding the Future of Biomedical Image Analysis: A Foundational Model for Multi-Modal Joint Segmentation, Detection, and Recognition

Background

In biomedical research, image analysis has become a crucial tool for advancing discoveries, enabling multi-scale studies ranging from organelles to organs. However, traditional biomedical image analysis often treats segmentation, detection, and recognition as separate tasks. This disjointed approach not only limits the exchange of information across tasks but also complicates the processing of diverse and complex biomedical image data.

For example, traditional segmentation methods often rely on manually defined bounding boxes to mark regions of interest, which is challenging for irregularly shaped or numerous targets (e.g., segmenting all cells in whole-slide pathology images). Moreover, ignoring the interplay between object detection and semantic recognition (e.g., metadata-like information) leaves room for improvement in segmentation quality.

To address these challenges, a research team from Microsoft Research, Providence Genomics, and the University of Washington has proposed a biomedical foundation model named “BiomedParse.” This model aims to unify the three tasks within a single framework and spans analysis across nine major imaging modalities. Published in the January 2025 issue of Nature Methods, this work introduces a new workflow for efficiently parsing biomedical images.


Study Overview and Workflow

The proposed “BiomedParse” represents an innovative image parsing framework capable of jointly performing segmentation, detection, and recognition, effectively overcoming the limitations of traditional methods. To train this model, the research team constructed a large-scale biomedical dataset named “BiomedParseData,” covering nine imaging modalities, including computed tomography (CT), magnetic resonance imaging (MRI), pathology, ultrasound, and more. The detailed steps of the study are described below:

Data Construction and Preprocessing

The research team synthesized 45 publicly available biomedical segmentation datasets, generating approximately 3.4 million image–segmentation mask–semantic label triples. By leveraging the GPT-4 language model, they aligned noisy and unstructured textual descriptions from datasets with a standardized biomedical object ontology, which includes:

  1. Three broad categories: Organ, abnormality, and histology.
  2. 15 meta-object types: Examples include “right kidney” and “tumor.”
  3. 82 specific object types.

In addition, the team used GPT-4 to generate synonymous text descriptions, increasing the diversity and robustness of linguistic expressions, ensuring that the model can accurately recognize targets regardless of variations in textual phrasing.

To accommodate three-dimensional imaging modalities (e.g., CT and MRI), the researchers preprocessed the data into two-dimensional slices to maintain a consistent input structure across all modalities.

BiomedParse Model Architecture

BiomedParse employs a modular design comprising the following core components:

  1. Image Encoder: Extracts features from high-resolution input images. Options include cutting-edge Focal Modulation Network (Focal) or Segment Anything Model Vision Transformer (SAM-ViT).

  2. Text Encoder: Processes user-specified text prompts to generate textual embeddings. Initialization options include the pretrained PubMedBERT model.

  3. Mask Decoder: Generates segmentation masks based on image and text embeddings, predicting the likelihood (0–1) of each pixel belonging to a target object.

  4. Meta-Object Classifier: Facilitates semantic classification of the target.

BiomedParse incorporates joint learning, enabling information sharing between segmentation and semantic classification tasks, thereby improving the ability to handle complex targets.

Model Training and Optimization

To train BiomedParse, the team used the BiomedParseData dataset, randomly splitting it into training (80%) and testing (20%) subsets. The training process optimized the following loss functions:

  • Binary cross-entropy and Dice loss for segmentation tasks.
  • Categorical cross-entropy loss for semantic classification tasks.

The entire training process required high-performance computational resources, completing in 58 hours on 16 NVIDIA A100 GPUs.


Key Results and Findings

Accuracy and Scalability in Multi-Modal Image Segmentation

In large-scale testing on 102,855 samples, BiomedParse achieved a groundbreaking average Dice score of 0.857 for segmentation tasks, a 39.6% improvement over the best competing method, Medsam. The model performed exceptionally well for targets with complex shapes, such as abnormal cells and tumor regions.

Notably, BiomedParse requires only a textual description to perform segmentation, whereas Medsam and SAM need detailed bounding boxes for each target. In a test with 42 colon pathology images, providing a single text prompt—”glandular structure in colon pathology”—enabled BiomedParse to achieve a median Dice score of 0.942. In contrast, Medsam required users to manually annotate 430 bounding boxes and still fell short of BiomedParse’s accuracy.

Superior Detection of Irregularly Shaped Objects

To evaluate BiomedParse’s ability to handle irregularly shaped targets, the researchers introduced three quantitative metrics: convex ratio, box ratio, and rotational inertia. The results showed a strong correlation between these metrics and the model’s performance gains. Specifically, BiomedParse demonstrated greater accuracy for small or irregularly shaped targets compared to traditional methods.

Performance in Holistic Object Recognition

For object recognition, BiomedParse leverages its built-in segmentation ontology to iteratively identify and segment every object type in an image. This resulted in exceptional performance, with a weighted average Dice score of 0.94. Remarkably, BiomedParse significantly outperformed Grounding DINO, a method that only generates bounding boxes and struggles with images containing multiple objects.

In addition, BiomedParse employs statistical methods to reject invalid text prompts (e.g., “identify left heart ventricle in a dermoscopy image”), preventing errors from misunderstandings or hallucinations.


Significance and Highlights

  1. Innovative Unified Framework: BiomedParse is the first model to integrate segmentation, detection, and recognition tasks through joint learning, addressing the limitations of traditional disjointed methods.

  2. Boundary-Free Analysis: By accepting text prompts instead of bounding boxes, BiomedParse significantly reduces the burden on users, particularly in images with numerous objects.

  3. Excellence on Complex Shapes: The model excels at generalizing to irregularly shaped targets, such as scattered abnormal cells and tumors.

  4. Scalability and Practicality: Tested on real-world data from Providence Health System, BiomedParse accurately annotated immune and cancer cells, demonstrating its potential for clinical applications.

BiomedParse provides an efficient, accurate, and universal solution for biomedical image parsing, paving the way for large-scale, image-driven biomedical discoveries. With future extensions to three-dimensional imaging and interactive dialogue systems, this model holds immense promise for a wide range of clinical and research applications.