What are pseudonymisation and anonymisation?

On this page: anonymous, pseudonymous, deidentification, safeguard, protection measure, identifiable, sdc, statistical disclosure control, disclosure risk
Date of last review: 2023-05-02

Pseudonymisation

Pseudonymisation is a safeguard that reduces the linkability of your data to your data subjects (rec. 28). It means that you de-identify the data in such a way that they can no longer lead to identification without additional information (art. 4(5)). In theory, removing this additional information should lead to anonymised data.

Pseudonymisation is often interpreted as replacing direct identifiers (e.g., names) with pseudonyms, and storing the link between the identifiers and the pseudonyms in a key file, separated from the research data. While this is a good practice (it makes sure that data are not directly identifiable anymore), this interpretation of pseudonymisation does not take into account indirectly identifiable information, and thus does not necessarily fulfil the GDPR’s definition of pseudonymisation!

Pseudonymous data are still personal data and thus subject to the GDPR. This is because the de-identification is reversible: identifying data subjects is still possible, just more difficult. This means that in order to use pseudonymous data, you still need to comply to all the rules in the GDPR.

Anonymisation

Anonymisation is a de-identification process that results in data that are “rendered anonymous in such a manner that the data subject is not or no longer identifiable” (rec. 26), neither directly nor indirectly, and by no one, including you. When data are anonymised, they are no longer personal data, and thus no longer subject to the GDPR. Note, however, that everything you do before the data are anonymised (including the anonymisation itself) is subject to the GDPR!

Anonymisation is very difficult to accomplish in practice! This video nicely illustrates why.

The identifiability spectrum

The relationship between (identifiable) personal data, pseudonymous data and anonymous data should be seen as lying on a spectrum. The more de-identified the data are, the closer they are to anonymous data and the lower the risk of re-identification. The visual guide below nicely illustrates this:

If the image does not show correctly, view it online

When are data anonymous?

Your data can be considered anonymous if data subjects can only be re-identified with an unreasonable amount of effort, i.e., taking into account the costs, required time and technology, and future technological developments (rec. 26).

Basically, your data are not anonymous (personal) when they comply with any of the characteristics of personal data:

There is directly identifiable information (e.g., name, email address, social security number, etc.).
Data subjects can be singled out (i.e., you can tell one data subject from another within a known group of data subjects).
It is possible to identify data subjects by linking records (“mosaic effect”), either within your own database or when using other data sources.
It is possible to identify a data subject by inferring information about them (e.g., infer a disease by the variable “medication”), either within your own database or when using other data sources.
It is possible to reverse the de-identification.

Whether data can be seen as anonymous strongly depends on the context of your research and how much information is available about the data subjects.

Comic about anonymous data. The left pane shows an animal that says ‘Don’t worry! We will only save general information about you, not anything that could identify you!’, while holding a paper that says ‘Brown chicken’. Next to the animal is a brown chicken giving a ‘thumbs up’. The right pane shows that one brown chicken in a crowd of white chickens.

When collaborating with research data centres, such as the Statistics Netherlands (Centraal Bureau voor de Statistiek, CBS), often output checking guidelines are used to determine the risk of identification resulting from the analysis output of sensitive data.

Alternatives to anonymisation

Anonymisation is not the only solution. The best way to protect data subjects’ privacy is to only collect/process their personal data if necessary (minimisation). Additionally, in many cases, full anonymisation is not even possible or desirable, for example if it results in too much information loss or incorrect inferences.

If you cannot anonymise the data, there are always other ways in which you can protect the data, such as:

De-identifying (pseudonymising) the data to the extent you can.
Controlling access to the data, for example using user agreements, authentication, encryption, secure analysis environments, etc.
Creating a synthetic version of your dataset to share with others.