De-identification techniques
On this page: anonymous, pseudonymous, deidentification, safeguard, protection
measure, technique, anonymisation method, privacy-preserving, privacy-enhancing,
sdc, statistical disclosure control, disclosure risk
Date of last review: 2023-05-02
Below is a list of techniques you can apply to your data to de-identify your dataset so that it results in a pseudonymised, or possibly even anonymised dataset. Bear in mind that applying these will always result in loss of information, so ask yourself how useful your dataset will still be after de-identification.
The techniques are:
Statistical Disclosure Control (SDC)
The below de-identification methods are sometimes also referred to as methods
to apply Statistical Disclosure Control (SDC). You will most likely encounter
SDC when you collaborate with a research data centre such as Statistics
Netherlands (Centraal Bureau voor de Statistiek, CBS).
Suppression
Suppression (sometimes called “masking”) basically means removing variables, (parts of) values, or entire entries that you do not need from your dataset. Examples of data that you could consider removing:
- Name and contact information
- (Parts of) address
- Date, such as birthdate or participation date
- Social security number/Burgerservicenummer (BSN). NB. In the Netherlands, you are not allowed to use BSN in research at all!
- Medical record number
- IP address
- Facial features from neuroimaging data
- Automatically generated metadata such as GPS data in an image, author in a document, etc.
- Participants that form extreme outliers or are too unique
Generalisation
Generalisation (also sometimes called abstraction, binning, aggregation, or categorisation) reduces the granularity of the data so that data subjects are less easily singled out. It can be applied to both qualitative (e.g., interview notes) and quantitative data (e.g., variables in a dataset). Here are some examples:
- Recoding date of birth into age.
- Categorising age into age groups.
- Recoding rare categories as “other” or as missing values.
- Replacing address with the name of a neighbourhood or town.
- Generalising specific persons in text into broader categories, e.g., “mother” to “[woman]”, “Bob” to “[colleague]”.
- Generalising specific locations into more general places, e.g., “Utrecht” to “[home town]”, or from point coordinates to larger geographical areas (e.g., polygon or linear features).
- Coding open-ended responses into categories of responses, or as “responded” vs. “not responded”.
Replacement
In this case, you replace sensitive details with non-sensitive ones, which are usually less informative, for example:
- Replacing directly identifying information that you do need with pseudonyms. When doing this, always store the key file securely and separately from the research data (e.g., use access control, encryption). If you do not need the links with direct identifiers anymore, remove the keyfile or replace the pseudonyms with random identifiers without saving the key.
- Replacing identifiable text with “[redacted]”. When redacting changes in-text,
never just blank out the identifying value, always put a placeholder or
pseudonym there, e.g., in
[
square brackets]
or<seg>
segments</seg>
. - Replacing unique values with a summary statistic, e.g., the mean.
- Rounding values, making the data less precise.
- Replacing one or multiple variables with a hash.
Creating a pseudonym
A good pseudonym:
- Is not meaningful with respect to the data subjects: a random (unique) number or string is better than a code that contains parts of personal information, because the latter may reveal details about data subjects.
- Is managed securely, for example by appointing someone to be responsible for managing access to the keyfile.
- Can be a simple number, random number, cryptographic hash function, text string, etc (read more).
Here are some example random id generation solutions for different softwares: Excel, R, Python, SPSS
Top- and bottom-coding
Top- and bottom-coding are mostly useful for quantitative datasets that have some unique extreme values. It means that you set a maximum or minimum and recode all higher or lower values to that minimum or maximum. For example, you can top-code a variable “income” so that all incomes over €80.000 are set to €80.000. This does distort the distribution, but leaves a large part of the data intact.
Adding noise
Adding noise to data obfuscates sensitive details. It is mostly applied to quantitative datasets, but can also apply to other types of data. For example:
- Adding half a standard deviation to a variable.
- Multiplying a variable by a random number.
- Applying Differential Privacy guarantees to an algorithm.
- Blurring (pixelating) images and videos.
- Voice alteration in audio.
Permutation
Permutation means swapping values between data subjects, so that it becomes more difficult to link information belonging to one data subject together. This will keep the distribution and summary statistics constant, but change correlations between variables, making some statistical analyses more difficult or impossible.