“Regular” data analysis: data-to-code

On this page: analysis, data-to-code, data-to-script, transfer, sharing
Date of last review: 2023-04-02

Graphical depiction of the data-to-code
scenario: the data are brought to the code (e.g., by transferring them)

In this scenario, you transfer the data to a computing facility, and run an analysis (script) on the data.

In the most basic variant, this computing facility consists of your work computer or faculty computing cluster, where you do not transfer the data outside of your organisation for the analysis. In other cases, data need to be transferred to a computing facility outside your organisation, such as high-performance clusters from SURF, Microsoft, Amazon, etc.

When to use

If you have a relatively small dataset, the “data-to-code” scenario is the most common and flexible scenario:

It allows you to choose a computing facility that is best suited to your situation.
It allows you to interactively read, analyse, export and transport the data you want.

Disadvantages of this approach can be:

When transferring the data to a computing facility, often new copies of the data are created, which can make it more difficult to keep track of different versions of the data.
Transferring data always comes with additional risks of a data breach. Besides protection during data storage, it is therefore crucial to also protect the data during the transfer to the computing facility, and when used at the computing facility itself.
The way the data are transferred to the computing facility is not always as straightforward, especially if you have a large dataset.

Implications for research

In this scenario, you need to make sure that:

You apply data minimisation, access control, and, if applicable, pseudonymisation and other protective measures to limit the amount of personal data that is transferred to the computing facility.
The data are also protected during the transfer to the computing facility (e.g., your work laptop or an external solution), for example through encryption.

Additionally, if the computing facility is provided by an external processor (e.g., SURF, Amazon):

A data processing agreement with the provider of the computing facility is needed. If there is none, you cannot use the computing facility to analyse personal data.
The computing facility should be suitable (secure enough) for the sensitivity level of your (personal) data. For example, if your data are “critical” in terms of confidentiality, the computing facility should also have that “critical” classification.

Examples

You use your faculty’s high performance cluster to analyse a dataset that you collected at your organisation.
You use the High Performance Computing platform from SURF to analyse a large dataset that you collected at your organisation. In this case, a data processing agreement between your organisation and SURF is needed to make sure that your organisation remains in control of the personal data at SURF’s servers.
You use Amazon Web Services (AWS) to analyse a large dataset that you collected at your organisation. In this case, a data processing agreement between your organisation and AWS is even more important, because Amazon has servers that are located outside of the European Economic Area.