“Regular” data analysis: data-to-code
On this page: analysis, data-to-code, data-to-script, transfer, sharing
Date of last review: 2023-04-02
In this scenario, you transfer the data to a computing facility, and run an
analysis (script) on the data.
In the most basic variant, this computing facility consists of your work
computer or faculty computing cluster, where you do not transfer the data
outside of your organisation for the analysis. In other cases, data need to
be transferred to a computing facility outside your organisation, such as
high-performance clusters
from SURF, Microsoft, Amazon, etc.
When to use
If you have a relatively small dataset, the “data-to-code” scenario is the most common and flexible scenario:
- It allows you to choose a computing facility that is best suited to your
situation.
- It allows you to interactively read, analyse, export and transport the data you want.
Disadvantages of this approach can be:
- When transferring the data to a computing facility, often new copies of the data are created, which can make it more difficult to keep track of different versions of the data.
- Transferring data always comes with additional risks of a data breach. Besides
protection during data storage, it is therefore crucial to also protect the data
during the transfer to the computing facility, and when used at the computing
facility itself.
- The way the data are transferred to the computing facility is not always as straightforward, especially if you have a large dataset.
Implications for research
In this scenario, you need to make sure that:
- You apply data minimisation, access control, and, if applicable, pseudonymisation and other protective measures to limit the amount of personal data that is transferred to the computing facility.
- The data are also protected during the transfer to the computing facility (e.g., your work laptop or an external solution), for example through encryption.
Additionally, if the computing facility is provided by an external processor (e.g., SURF, Amazon):
- A data processing agreement with the provider of the computing facility is needed. If there is none, you cannot use the computing facility to analyse personal data.
- The computing facility should be suitable (secure enough) for the sensitivity level of your (personal) data. For example, if your data are “critical” in terms of confidentiality, the computing facility should also have that “critical” classification.
Examples
- You use your faculty’s high performance cluster to analyse a dataset that you collected at your organisation.
- You use the High Performance Computing platform from SURF to analyse a large dataset that you collected at your organisation. In this case, a data processing agreement between your organisation and SURF is needed to make sure that your organisation remains in control of the personal data at SURF’s servers.
- You use Amazon Web Services (AWS) to analyse a large dataset that you collected at your organisation. In this case, a data processing agreement between your organisation and AWS is even more important, because Amazon has servers that are located outside of the European Economic Area.