Generalization of Deep Learning in Digital Pathology: Experience in Breast Cancer Metastasis Detection.
Sofia Jarkman, Micael Karlberg, Milda Pocevičiūtė, Anna Bodén, Péter Bándi, Geert Litjens, Claes Lundström, Darren Treanor, Jeroen van der Laak
Author Information
Sofia Jarkman: Department of Clinical Pathology, and Department of Biomedical and Clinical Sciences, Linköping University, 581 83 Linköping, Sweden. ORCID
Micael Karlberg: Center for Medical Image Science and Visualization (CMIV), Linköping University, 581 85 Linköping, Sweden.
Milda Pocevičiūtė: Center for Medical Image Science and Visualization (CMIV), Linköping University, 581 85 Linköping, Sweden.
Anna Bodén: Department of Clinical Pathology, and Department of Biomedical and Clinical Sciences, Linköping University, 581 83 Linköping, Sweden.
Péter Bándi: Department of Pathology, Radboud University Medical Center, P.O. Box 9101, 6500 HB Nijmegen, The Netherlands.
Geert Litjens: Department of Pathology, Radboud University Medical Center, P.O. Box 9101, 6500 HB Nijmegen, The Netherlands. ORCID
Claes Lundström: Center for Medical Image Science and Visualization (CMIV), Linköping University, 581 85 Linköping, Sweden. ORCID
Darren Treanor: Department of Clinical Pathology, and Department of Biomedical and Clinical Sciences, Linköping University, 581 83 Linköping, Sweden.
Jeroen van der Laak: Center for Medical Image Science and Visualization (CMIV), Linköping University, 581 85 Linköping, Sweden. ORCID
Poor generalizability is a major barrier to clinical implementation of artificial intelligence in digital pathology. The aim of this study was to test the generalizability of a pretrained deep learning model to a new diagnostic setting and to a small change in surgical indication. A deep learning model for breast cancer metastases detection in sentinel lymph nodes, trained on CAMELYON multicenter data, was used as a base model, and achieved an AUC of 0.969 (95% CI 0.926-0.998) and FROC of 0.838 (95% CI 0.757-0.913) on CAMELYON16 test data. On local sentinel node data, the base model performance dropped to AUC 0.929 (95% CI 0.800-0.998) and FROC 0.744 (95% CI 0.566-0.912). On data with a change in surgical indication (axillary dissections) the base model performance indicated an even larger drop with a FROC of 0.503 (95%CI 0.201-0.911). The model was retrained with addition of local data, resulting in about a 4% increase for both AUC and FROC for sentinel nodes, and an increase of 11% in AUC and 49% in FROC for axillary nodes. Pathologist qualitative evaluation of the retrained model´s output showed no missed positive slides. False positives, false negatives and one previously undetected micro-metastasis were observed. The study highlights the generalization challenge even when using a multicenter trained model, and that a small change in indication can considerably impact the model´s performance.