Evaluating Generalizability of Oncology Trial Results to Real-World Patients Using Machine Learning-Based Trial Emulations

Evaluation of the Generalizability of Oncology Trial Results Using Machine Learning-Based Trial Emulations

Academic Background

Randomized Controlled Trials (RCTs) are the gold standard for evaluating the efficacy of anti-cancer drugs, but their results often cannot be directly generalized to real-world oncology patients. RCTs typically employ strict eligibility criteria, leading to significant differences between the study population and real-world cancer patient groups. Additionally, selection biases related to patient prognostic risk in RCTs further limit the generalizability of trial results. To address this issue, researchers developed a framework called TrialTranslator, which uses machine learning models to stratify real-world cancer patients by risk and simulates RCTs to systematically evaluate the generalizability of trial results.

This study aims to answer the following questions: Can real-world cancer patients benefit from the survival gains reported in RCTs? Are there significant differences in survival time and treatment benefits among patient groups with different prognostic risks? By combining Electronic Health Records (EHRs) and machine learning technologies, this study provides new tools for individualized treatment decision-making and offers important references for future clinical trial design.

Paper Source

The study was conducted by Xavier Orcutt, Kan Chen, Ronac Mamtani, Qi Long, and Ravi B. Parikh, among others. The research team comes from institutions such as Navajo Indian Health Service, Harvard University, University of Pennsylvania, and Emory University. The paper was published in Nature Medicine in February 2025, titled “Evaluating generalizability of oncology trial results to real-world patients using machine learning-based trial emulations.”

Research Workflow

1. Study Design

The research is divided into two main steps:

Step One: Prognostic Model Development

The goal of this step is to develop machine learning models capable of predicting the mortality risk of cancer patients. The research team used EHR data from the Flatiron Health database, which contains patient data from approximately 280 cancer clinics across the United States. The study focused on four of the most common advanced solid tumors: non-small cell lung cancer (NSCLC), metastatic breast cancer (MBC), metastatic prostate cancer (MPC), and metastatic colorectal cancer (mCRC).

  • Data Preprocessing: The research divided patient feature data into training and test sets and evaluated the model at specific time points after diagnosis of metastatic cancer (1 year for NSCLC and 2 years for other cancers).
  • Model Construction: The research team developed various machine learning models, including Gradient Boosting Survival Models (GBM), Random Survival Forests (RSF), Linear Support Vector Machines (SVM), and Penalized Cox Proportional Hazards Models (pCox). For comparison, the study also constructed benchmark models based on the classic Cox Proportional Hazards Model.
  • Model Evaluation: Model performance was assessed using the area under the time-dependent ROC curve (AUC). The results showed that GBM exhibited the highest predictive performance across all four cancer types.

Step Two: Trial Emulation

The goal of this step is to simulate RCTs and assess treatment effects across different prognostic risk groups.

  • Eligibility Matching: The research team screened real-world patients meeting key eligibility criteria of RCTs from the Flatiron Health database. Eligibility criteria included correct cancer type, receiving specific lines of therapy, and relevant biomarker status.
  • Prognostic Phenotyping: The GBM model calculated patients’ mortality risk scores and stratified them into three prognostic phenotypes: low-risk, medium-risk, and high-risk based on these scores.
  • Survival Analysis: Treatment effects were calculated for each prognostic phenotype using Kaplan-Meier survival curves adjusted by Inverse Probability of Treatment Weighting (IPTW). The study used Restricted Mean Survival Time (RMST) and Median Overall Survival (mOS) as primary metrics.

2. Research Results

Prognostic Model Development

The GBM model demonstrated the highest predictive performance across all four cancer types. For example, in NSCLC, the 1-year survival AUC for GBM was 0.783, significantly outperforming the 0.689 of the benchmark Cox model. Predictive features of the model included age, weight change, ECOG score, cancer biomarkers, and serum markers (such as albumin and hemoglobin).

Trial Emulation

The study emulated 11 key RCTs covering four cancer types. The results showed that survival times and treatment benefits for low-risk and medium-risk patients were similar to those reported in RCTs, while survival times and treatment benefits for high-risk patients were significantly lower than those reported in RCTs. In more than half of the emulated trials, the treatment effect (RMST or mOS difference) for high-risk patients was less than 3 months, whereas low-risk and medium-risk patients were more likely to achieve clinically meaningful survival benefits.

3. Conclusion

This study shows that the survival times and treatment benefits reported in RCTs generalize well to certain patient groups, especially low-risk and medium-risk patients. However, the survival times and treatment benefits for high-risk patients are significantly lower than those reported in RCTs. This finding emphasizes the importance of using more sophisticated methods of evaluating patient prognosis upon entry in clinical trial design to ensure that trial results can be better generalized to real-world patients.

4. Research Highlights

  • Innovative Methodology: The TrialTranslator framework developed by the research team combines EHR data and machine learning technology to systematically evaluate the generalizability of RCT results.
  • Individualized Treatment Decisions: This framework provides support for individualized treatment decisions for clinicians and patients, helping them better understand the expected benefits of new therapies.
  • Optimization of Clinical Trial Design: The study results offer important references for future clinical trial design, suggesting the use of more complex prognostic evaluation methods during trial enrollment to enhance the generalizability of trial results.

5. Other Valuable Information

The research team also developed a web tool named TrialTranslator (https://www.trialtranslator.com/), allowing users to input patient information to obtain prognostic phenotypes and survival estimates from emulated trials. This tool is intended for research purposes, helping clinicians and patients better understand treatment options and expected benefits.

Summary

This study systematically evaluates the generalizability of RCT results in real-world cancer patients by combining EHR data and machine learning technology. The results indicate that prognostic risk stratification plays a significant role in predicting patient survival times and treatment benefits. This study provides new tools and methods for individualized treatment decisions and clinical trial design, holding substantial scientific and practical value.