Federated analysis

On this page: federated analysis, federated learning, machine learning, distributed analysis, distributed learning, collaboration, harmonisation
Date of last review: 2023-04-02

Federated analysis is an extension to the code-to-data scenario, where the data of interest are owned by multiple organisations. In this scenario, the data remain with multiple data providers, and the script “travels” across those data sources, combining the results in a central location, and only sharing the results of the analysis. If necessary, there are techniques to hide intermediate results (which could also reveal sensitive information). If the script in question is a machine learning model, then this technique is called “federated (machine) learning”. You can learn more about federated analysis in this article.

When to use

Federated analysis is useful when there are multiple data providers who do not allow transferring their data outside of the organisation, or whose data are simply too large to share.

Implications for research

  • A prerequisite for analysing data in this way is often that the data at the different providers are similarly structured and use similar terminology (e.g., making sure that every party uses “male”, “female”, and “other” as levels for the variable Gender, instead of “girl” and “boy”, or 0 and 1).
  • Federated analysis works best for “horizontally partitioned” datasets, where different organisations have the same (types of) information, but from different people. It is not well-suited for “vertically partitioned” datasets, where the different organisations have different (types of) information on the same people and thus want to link those different data sources.
  • Setting up the infrastructure for federated analysis is challenging and can take a large amount of time (software installation, access rights, linking datasets, etc.). It is wise to first investigate whether this option is indeed the most suitable for your project.

Examples

  • A research team needs access to various datasets containing health data to determine which factors contribute to health of Covid-19 patients at various hospitals. Each dataset contains health data from patients of the hospital where they are treated. Since each dataset contains sensitive personal data, it is not desirable to store these datasets in a central location to combine them. To be able to answer the research question, one needs to access each dataset separately and combine the results of each dataset. To make this possible, each hospital provides a computing facility. The research team submits their script to each of the computing facilities, where it is run on the local dataset. After a check by each hospital’s staff that the results do not contain any sensitive details, the results of the individual computations are combined centrally into one result. In the example, the result of the calculation at each hospital is a prediction model for Covid-19 patients, and the individual models are combined to create a more reliable prediction model.
  • Several university medical centres use the Personal Health Train from Health-RI, which relies on the vantage6 software.
  • DataSHIELD is an infrastructure and a series of R packages that allows to co-analyse data hosted at different organisations. It requires harmonising the data at the different organisations and setting up the DataSHIELD infrastructure.