Generation of Synthetic Whole-Slide Image Tiles of Tumours from RNA-Sequencing Data via Cascaded Diffusion Models
Generation of Synthetic Whole-Slide Images of Tumors from RNA Sequencing Data via Cascaded Diffusion Models
A recent study published in Nature Biomedical Engineering, titled “Generation of Synthetic Whole-Slide Image Tiles of Tumours from RNA-Sequencing Data via Cascaded Diffusion Models,” has garnered significant attention. This research, conducted by scholars from Stanford University, Ghent University, Argonne National Laboratory, and other institutions, aims to address the issue of insufficient cancer data, thereby enhancing the performance of machine learning models in cancer detection. The authors of this paper include Francisco Carrillo-Perez, Marija Pizurica, Yuanning Zheng, Tarak Nath Nandi, Ravi Madduri, Jeanne Shen, and Olivier Gevaert.
Research Background and Motivation
Cancer is one of the leading causes of death worldwide, surpassed only by cardiovascular diseases. In clinical settings, doctors typically diagnose cancer through multiple screening methods, including visual inspection of digitized tissue slides or the specific upregulation or downregulation of gene expression in patients. However, these screening methods are often not comprehensively applied to the same patient due to cost or logistical constraints. Cancer is a multi-scale, multifactorial disease, with effects manifesting at various levels. Genetic variations in tumor cells and cells within the tumor microenvironment can lead to functional changes, thereby impacting their cell physiology. As a result, the lack of comprehensive screening methods may lead to missing crucial information for early detection.
In recent years, machine learning, particularly deep learning (DL), has shown great potential in cancer detection and classification. By using different multimodal data such as RNA sequencing (RNA-seq), whole-slide imaging (WSI), microRNA sequencing (miRNA-seq), or DNA methylation data, many promising clinical decision support systems have been developed. However, cancer data pose two problems: first, DL models are data-driven and require a large amount of data for proper training; second, even in multimodal environments, combinations of biological data types have been demonstrated to perform better in cancer detection and prognosis, yet most existing datasets are typically incomplete and lack certain modalities.
Research Content
This study proposes a cascaded diffusion model approach to solve the above problems by using generative models such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). The research demonstrates that under the latent representation of RNA sequencing data in tumor tissues, cascaded diffusion models can be used to synthesize realistic whole-slide image tiles.
The study mainly includes the following processes:
a) Detailed Description of Research Process
Data Preprocessing and Acquisition: The study obtained data from the TCGA project database, which contains paired RNA-seq and WSI samples. After downloading, RNA-seq data underwent preprocessing, including aligning and quantifying the raw sequencing reads, resulting in expression data for 17,655 genes. These data were log-transformed and Z-score normalized.
Beta-VAE Generation of Multi-Cancer Latent Embedding Representations: Twelve cancer types were selected to train the Beta-VAE model to generate a low-dimensional latent representation of the RNA-seq data. The encoder and decoder of the Beta-VAE each consist of two hidden layers, and the latent space is composed of 200 features. Trained over 250 epochs with Mean Squared Error (MSE) and Adam optimizer, the resulting latent representations can accurately reconstruct RNA-seq data.
Cascaded Diffusion Model RNA-CDM Generation of Multi-Cancer RNA-to-Image Synthesis: The cascaded diffusion model includes a low-resolution diffusion model (64×64 pixels) as the RNA-to-Image model and a super-resolution model (256×256 pixels), conditionally trained using the latent representations generated by the Beta-VAE model. During training, noise is gradually applied to images, and the model learns to denoise. Thus, when given an RNA-seq latent encoding, the model can generate high-resolution synthetic slide images.
Model Training and Evaluation: HoverNet was used to classify and segment cell types in both real and synthetic images to assess the quality of generated images. Further, Uniform Manifold Approximation and Projection (UMAP) algorithms were utilized to project and display reconstructed RNA-seq data from different cancer tissues, verifying the model’s generalization capability.
b) Main Results of the Research
The study verified the authenticity of generated images by comparing HoverNet’s detection of cell distributions in both real and synthetic images. For five cancer types (lung adenocarcinoma, kidney cancer, cervical squamous cell carcinoma, colon cancer, and glioblastoma), cell detection results in real and synthetic images were similar. Quantitative analysis further showed that synthetic images could maintain the cellular morphology and specific cell proportions of real data. Changes in gene expression markers in RNA-seq data would even affect the appearance frequency of corresponding cell types.
The study also demonstrated that pre-training with synthetic data could improve the performance of machine learning models on biomedical classification tasks. Experiments replacing parts of the training set comprised of real data proved that synthetic data could accurately substitute real data without affecting the classification task’s performance. Further, pre-training models using all synthetic data and fine-tuning on a small amount of real data significantly improved the model’s accuracy and F1 score.
c) Conclusions and Value of the Research
The RNA-CDM model proposed in this study not only addresses the issue of data scarcity but also accelerates the development and performance enhancement of machine learning models by generating realistic synthetic slide images. The multi-cancer RNA-to-Image synthesis method of RNA-CDM can serve practical purposes in data augmentation and identifying new morphological features related to clinically relevant molecular biological states currently unrecognized by the human eye.
d) Research Highlights
Methodological Innovation: This study is the first to propose using a cascaded diffusion model for RNA-to-Image synthesis and employs a single architecture to generate tissue slide images of multiple cancer types. This is more efficient compared to previous methods that required separate models for each cancer type.
Broad Application Prospects: Synthetic data can be used for data augmentation and pre-training of machine learning models, effectively enhancing the performance of actual tasks.
e) Other Valuable Information
Future research can integrate spatial transcriptomics technology to generate benchmark data for local RNA expression, further improving the model’s accuracy. Additionally, future research should strive to develop innovative computational strategies to handle tasks of generating higher resolution or entire slide images. Such advancements will further enhance the application value of the RNA-CDM model in cancer detection and classification.