Deep Learning Algorithms for Breast Cancer Detection in a UK Screening Cohort: As Stand-Alone Readers and Combined with Human Readers

Deep Learning Algorithms in Breast Cancer Screening

Academic Background

Breast cancer is one of the most common cancers among women worldwide, and early screening is crucial for improving cure rates. Traditional Computer-Aided Detection (CAD) systems have been widely used in mammographic screening, particularly in the United States. However, while these systems have increased recall rates, they have shown limited improvement in reader (i.e., radiologist) performance. In recent years, the application of Deep Learning (DL) algorithms in medical imaging analysis has grown rapidly, especially in the field of breast cancer screening. Several systematic reviews and meta-analyses have shown a rapid increase in evidence supporting the use of DL algorithms in mammographic screening since 2017. Although some studies have demonstrated that DL algorithms perform no worse than human readers when used as single readers, no standalone algorithm has yet shown superior performance compared to the standard double-reading system while maintaining acceptable recall rates. Therefore, DL algorithms currently cannot entirely replace human readers in double-reading systems.

However, existing studies have several limitations, such as the use of small test cohorts, lack of external validation, and absence of preset performance thresholds. Additionally, many studies do not include data on interval cancers and cancers detected in the next round of screening, which are essential for evaluating the effectiveness of DL algorithms in early detection. Therefore, this study aims to validate the performance of three DL algorithms in mammographic screening using an independent external dataset, exploring their performance as standalone readers and in combination with human readers.

Source of the Paper

This paper was authored by Sarah E. Hickman et al., from the Department of Radiology at the University of Cambridge School of Clinical Medicine, the Royal London Hospital, Cambridge University Hospitals NHS Foundation Trust, and other institutions. The paper was published in November 2024 in the journal Radiology, titled Deep Learning Algorithms for Breast Cancer Detection in a UK Screening Cohort: As Stand-Alone Readers and Combined with Human Readers.

Research Process and Results

Research Process

This retrospective study used mammographic data from two UK screening sites (Cambridge and Norwich) spanning January to December 2017. The study included 26,722 cases, of which 332 were screen-detected cancers, 174 were interval cancers, and 254 were cancers detected in the next round of screening. The primary objective was to validate the performance of three commercial DL algorithms (DL-1, DL-2, and DL-3) as standalone readers and in combination with human readers.

The study was conducted in the following steps:

  1. Data Collection and Processing: The study used mammographic data from the Cambridge Cohort–East Anglia Digital Imaging Archive (CC-MEDIA) database. All images were stored in DICOM format and included corresponding clinical metadata. Cases that did not meet the criteria, such as those lacking two-view mammograms or ground truth labels, were excluded.

  2. Deployment and Evaluation of DL Algorithms: The three DL algorithms were deployed at the Cambridge research institution from January to June 2022 and evaluated using the study dataset. Details of the algorithm training were described in a previous publication.

  3. Performance Evaluation: The study preset a specificity threshold equivalent to that of a single reader (96.5%) and evaluated the performance of the DL algorithms as standalone readers and in combination with human readers. The primary evaluation metrics were sensitivity and specificity, with a statistical significance level set at p < 0.025.

Key Results

  1. Comparison of Standalone DL Reading with Single Human Reading: At the preset threshold, DL-1 and DL-3 achieved sensitivities of 64.8% and 58.9%, respectively, both noninferior to the single human reader (62.8%). DL-1 and DL-2 achieved specificities of 92.8% and 96.8%, respectively, both noninferior to the single human reader (96.5%), while DL-3 achieved a specificity of 97.9%, superior to the single human reader.

  2. Comparison of Combined DL and Human Reading with Double Reading: When combined with human readers, the DL algorithms achieved sensitivities of 67.0%, 65.6%, and 65.4%, all noninferior to the double-reading system (67.4%). The specificities were 97.4%, 97.6%, and 97.6%, all superior to the double-reading system (97.1%). However, the arbitration rate (i.e., the proportion of cases requiring review due to discordant reader decisions) increased when DL was combined with human reading.

  3. Detection of Interval and Next-Round Cancers: The DL algorithms outperformed human readers in detecting interval and next-round cancers. DL-1, DL-2, and DL-3 detected 23.6%, 13.2%, and 13.2% of interval cancers, respectively, and 23.2%, 12.6%, and 7.1% of next-round cancers, respectively, while the human reader detected only 9.2% of interval cancers and 5.1% of next-round cancers.

Conclusion

This study demonstrates that the three commercial DL algorithms perform no worse than single human readers as standalone readers and, when combined with human readers, maintain the same screening accuracy as the double-reading system. This provides strong support for the application of DL algorithms in breast cancer screening, suggesting that they can serve as a supplement to human readers, reducing workload and improving screening efficiency. However, DL algorithms currently cannot entirely replace human readers in double-reading systems, and future research needs to further explore the optimal application of DL algorithms in different screening programs.

Research Highlights

  1. Independent Validation: This study is the first to validate the performance of three commercial DL algorithms in an independent external dataset, ensuring the reliability and generalizability of the results.
  2. Multicenter Data: The study used data from two UK screening sites, covering mammographic equipment from different manufacturers, enhancing the broad applicability of the results.
  3. Detection of Interval and Next-Round Cancers: The DL algorithms outperformed human readers in detecting interval and next-round cancers, demonstrating their potential in early cancer detection.
  4. Advantages of Combining DL with Human Reading: When combined with human readers, DL algorithms maintained the same screening accuracy as the double-reading system while reducing workload, offering new insights for future screening programs.

Research Significance

This study provides important empirical support for the application of DL algorithms in breast cancer screening, demonstrating that they can serve as an effective supplement to human readers, reducing workload and improving screening efficiency. Future research needs to further explore the optimal application of DL algorithms in different screening programs and evaluate their long-term effectiveness in real clinical settings.