Code-to-data (one data provider)

On this page: code-to-data, script-to-data, algorithm-to-data, tools-to-data, SANE, digital research environment, secure research environment, virtual research environment, access control
Date of last review: 2023-04-02

In this scenario, an analysis is run on data without transferring the data outside of the organisation In many cases, only the results of the analysis can be exported, and not the data.

We distinguish the following versions of this scenario:

‘Tinker’ version: interaction with the data
In the Tinker version, users can log in to the computing facility and directly interact with the data, but there may be technical limitations on the import and export of the data. Procedural limitations should be posed through agreements with the user. This version can be implemented in multiple ways, such as:
  • Accessing and analysing locally stored data on premises. An example is analysing highly sensitive data in a dedicated room without an internet connection.
  • Accessing locally stored data through remote desktop. This usually does not impose technical limitations on what can be done with the data.
  • Virtual Research Environments (VREs) are temporary facilities where you can interactively perform computations on data in the cloud. In this case, it is sometimes possible to impose technical limitations on what can be done with the data (in which case these are called “Trusted Research Environments”). Examples of VREs are SURF Research Cloud and anDREa.
‘Blind’ version: remote execution

In the Blind version, users do not have access to the data at all, and only receive the results of an analysis, after reviews by the data owner(s) to ensure that the results do not contain sensitive details. In this case, a synthetic dataset can be provided to write and test the analysis script on, before it is run on the real dataset. This “blind” version could be run in a dedicated environment where researchers can upload their script, but can also be implemented manually, for example when a researcher sends a script by email to be run on a dataset, and receives the results back via email as well (i.e., this is possible when neither the script nor the results contain any sensitive details).

At the moment, both the Tinker and Blind versions of this code-to-data scenario are being developed as virtual research environments in the Secure ANalysis Environment project (SANE).

When to use

Reasons to use this scenario include:

  • You want to retain control over the data, e.g., to prevent any unnecessary copies from being made (data sovereignty).
  • You do not want, or are not allowed to transfer the data, because they contain personal data or intellectual property.
  • The dataset is too large to transfer.
  • In the ‘Blind’ version: You want to be sure that the analysis results do not contain any sensitive details.

Implications for research

Compared to the “data-to-code” scenario, the code-to-data approach offers more control over the data, but often requires more, sometimes manual, work, such as:

  • Checking the credentials of a user: can they be trusted? An agreement with the user may be desirable or even required. In SURF Research Cloud, credentials can be checked using SURF Research Access Management.
  • Preparing a protected computing environment that a user can use.
  • In the ‘Blind’ version:
    • Creating a synthetic dataset.
    • Reviewing the output of the script for sensitive elements. This requires the right expertise.
    • Reviewing whether the code that is run on the data is privacy-preserving. This also requires the right expertise.

It is essential to have a well-described workflow to use this scenario, to ensure confidentiality of the personal data. Additionally, dedicated personnel may make the process easier and consistent.

Examples

  • A research team needs to process a dataset containing health data to determine the number of Covid-19 patients at a certain hospital. The hospital providing this dataset does not allow transferring the dataset, but they do allow to run scripts on the dataset. To make that possible, the hospital provides a computing facility, owned by the hospital, to run scripts from research teams. In addition, for each result, the hospital staff inspects if it contains personal data, and if not, it will be passed onto the research team. Since a result like “100 patients at this hospital have had Covid-19 in 2021” does not contain personal data, it can be safely passed to the research team.
  • In the data donation approach, the software PORT can be run on data subjects’ locally stored data, and only the results of that analysis can be shared with the researcher if allowed by the data subject. Note however that the sensitivity of the results fully depend on the analysis that was run.