Zhaoxiang Cai (Westmead / AU), Emma Boys (Westmead / AU), Zainab Noor (Westmead / AU), Adel Aref (Westmead / AU), Dylan Xavier (Westmead / AU), Natasha Lucas (Westmead / AU), Steven Williams (Westmead / AU), Jennifer Koh (Westmead / AU), Rebecca Poulos (Westmead / AU), Yangxiu Wu (Westmead / AU), Michael Dausmann (Westmead / AU), Karen MacKenzie (Westmead / AU), Adriana Aguilar (Montreal / CA), Carolina Niell (Badalona / ES; Barcelona / ES), Maria Barranco (Badalona / ES; Barcelona / ES), Mark Basik (Montreal / CA), Elise Bowman (Bethesda, MD / US), Rory Clifton-Bligh (Sydney / AU; St Leonards / AU), Elizabeth Connolly (Westmead / AU), Wendy Cooper (Camperdown / AU), Bhavik Dalal (Bethesda, MD / US), Anna de Fazio (Westmead / AU; Woolloomooloo / AU), Martin Filipits (Vienna / AT), Peter Flynn (Kingswood / AU), J Dinny Graham (Westmead / AU), Jacob George (Sydney / AU; Westmead / AU), Anthony Gill (Sydney / AU; St Leonards / AU; Westmead / AU), Michael Gnant (Vienna / AT), Rosemary Habib (Westmead / AU; Blacktown / AU), Curtis Harris (Bethesda, MD / US), Kate Harvey (Darlinghurst / AU; Sydney / AU), Lisa Horvath (Sydney / AU; Camperdown / AU; Darlinghurst / AU), Christopher Jackson (Brisbane / AU), Maija Kohonen-Corish (Sydney / AU), Elgene Lim (Darlinghurst / AU), Georgina Long (Sydney / AU), Reginald Lord (Darlinghurst / AU), Graham Mann (Canberra / AU), Geoff McCaughan (Camperdown / AU), Lucy Morgan (Sydney / AU), Leigh Murphy (Winnipeg / CA), Adnan Nagrial (Sydney / AU; Westmead / AU), Ben Panizza (Brisbane / AU), Jas Samra (Sydney / AU; St Leonards / AU), Richard Scolyer (Sydney / AU), Ioannis Souglakos (Crete / GR), Alexander Swarbrick (Darlinghurst / AU; Sydney / AU; Crete / GR), David Thomas (Sydney / AU), Peter G. Hains (Westmead / AU), Rosemary Balleine (Westmead / AU), Phillip J. Robinson (Westmead / AU), Qing Zhong (Westmead / AU), Roger R. Reddel (Westmead / AU)
Introduction:The availability of large quantities of online data is underpinning the rapid development of many artificial intelligence (AI) applications. However, in biological and healthcare settings AI applications face challenges, including data privacy and security. Federated learning (FL), a distributed machine learning method, enables analysis of large biomedical datasets while ensuring data privacy, confidentiality and regulatory compliance. In this approach, local models are trained using data held inside several participating sites, with the data remaining private behind local firewalls. Local model updates are then aggregated in the server to improve the global model.
Methods: We obtained 30 distinct tissue sample cohorts, including a pan-cancer cohort (N = 1,260) from biobanks in Australia, with the remainder (N = 6,265) being largely single-cancer cohorts from centres in North America, Europe and Australia. We quantified 9,051 proteins across 7,525 tissue samples encompassing 57 cancer subtypes, using data-independent acquisition mass spectrometry (DIA-MS). From this we developed ProCanFL, a deep learning-based FL model that is trained on multiple simulated local sites using samples from the pan-cancer cohort and 90% of the 29 other cohorts. We simulated four local sites. The first site was fixed and housed the proteomic data and histopathology labels of the pan-cancer cohort, while sites 2-4 contained data from random combinations of the 29 cohorts. This randomization process was repeated 10 times to create 10 cohort allocations. A local model was trained at each site and the global model incorporated parameter updates from the local models.
Results: We used a hold-out test set (10% of the 29 cohorts) to evaluate the performance of individual local models in predicting 15 cancer subtypes for which a minimum number of samples were represented in the proteomic dataset. The average accuracy across the four sites, evaluated on the unseen samples in the test set was 0.68±0.14. After application of ProCanFL, the global model had a significantly enhanced predictive performance, achieving an accuracy of 0.97±0.01.
Conclusion: We demonstrated that FL enables a global model to be trained using local model updates from multiple decentralized sites containing private data, achieving high performance in cancer subtype prediction. Our study underscores the potential role of FL in the biomedical domain, where data sharing among global consortium members is often limited by several barriers, including country-specific data regulations. Further, the prediction accuracy from the FL model can be improved as more local sites and data are added. By harnessing the full wealth of large global datasets while keeping local data behind firewalls, FL offers a promising solution for developing omic foundation models and advancing the future of digital health.