Project R-15179

Title

Statistical modelling of federated data through sufficient statistics (Research)

Abstract

Understanding and extracting useful information from data are some of the shared goals between data providers and data analysts. However, both parties must also respect the right to data privacy of individuals from whom the data were collected. This imposes restrictions on how much and which kind of data can be disclosed by the data providers to the data analysts because classical estimation requires individual-level data to provide inferences that can be interpreted on an individual level. Federated learning tackles this hurdle by estimating parameters without retrieving individual-level data. Instead, iterative communication of parameter estimate updates between the data providers and analysts is required. In this research, we propose an alternative framework to federated learning for fitting commonly used statistical models such as generalized linear mixed models (GLMM). Specifically, our approach aims to utilize only summary statistics from different data providers once, thus eliminating iterative communication. It involves generating pseudo-data that matches the supplied summary statistics and using these into the model estimation process instead of the actual unavailable data. We aim to include multiple covariates which can be a combination of categorical and continuous variables in the model. We perform simulation experiments to evaluate the quality of the estimates produced through our proposed strategy and demonstrate its utility through publicly available real data. Simplicity, communication efficiency, generalisability, and wider scope of implementation in any statistical software distinguish our approach from existing strategies in the literature. This research can also cover other fields of study wherein individual-level data are inaccessible, and is not limited to medical research.

Period of project

16 September 2024 - 31 August 2026