De-identification techniques

On this page: anonymous, pseudonymous, deidentification, safeguard, protection measure, technique, anonymisation method, privacy-preserving, privacy-enhancing, sdc, statistical disclosure control, disclosure risk
Date of last review: 2023-05-02

Below is a list of techniques you can apply to your data to de-identify your dataset so that it results in a pseudonymised, or possibly even anonymised dataset. Bear in mind that applying these will always result in loss of information, so ask yourself how useful your dataset will still be after de-identification.

The techniques are:

Statistical Disclosure Control (SDC)
The below de-identification methods are sometimes also referred to as methods to apply Statistical Disclosure Control (SDC). You will most likely encounter SDC when you collaborate with a research data centre such as Statistics Netherlands (Centraal Bureau voor de Statistiek, CBS).

Suppression

Suppression (sometimes called “masking”) basically means removing variables, (parts of) values, or entire entries that you do not need from your dataset. Examples of data that you could consider removing:

  • Name and contact information
  • (Parts of) address
  • Date, such as birthdate or participation date
  • Social security number/Burgerservicenummer (BSN). NB. In the Netherlands, you are not allowed to use BSN in research at all!
  • Medical record number
  • IP address
  • Facial features from neuroimaging data
  • Automatically generated metadata such as GPS data in an image, author in a document, etc.
  • Participants that form extreme outliers or are too unique

Generalisation

Generalisation (also sometimes called abstraction, binning, aggregation, or categorisation) reduces the granularity of the data so that data subjects are less easily singled out. It can be applied to both qualitative (e.g., interview notes) and quantitative data (e.g., variables in a dataset). Here are some examples:

  • Recoding date of birth into age.
  • Categorising age into age groups.
  • Recoding rare categories as “other” or as missing values.
  • Replacing address with the name of a neighbourhood or town.
  • Generalising specific persons in text into broader categories, e.g., “mother” to “[woman]”, “Bob” to “[colleague]”.
  • Generalising specific locations into more general places, e.g., “Utrecht” to “[home town]”, or from point coordinates to larger geographical areas (e.g., polygon or linear features).
  • Coding open-ended responses into categories of responses, or as “responded” vs.  “not responded”.

Replacement

In this case, you replace sensitive details with non-sensitive ones, which are usually less informative, for example:

  • Replacing directly identifying information that you do need with pseudonyms. When doing this, always store the key file securely and separately from the research data (e.g., use access control, encryption). If you do not need the links with direct identifiers anymore, remove the keyfile or replace the pseudonyms with random identifiers without saving the key.
    A good pseudonym:
    • Is not meaningful with respect to the data subjects: a random (unique) number or string is better than a code that contains parts of personal information, because the latter may reveal details about data subjects.
    • Is managed securely, for example by appointing someone to be responsible for managing access to the keyfile.
    • Can be a simple number, random number, cryptographic hash function, text string, etc. (read more).
  • Replacing identifiable text with “[redacted]”. When redacting changes in-text, never just blank out the identifying value, always put a placeholder or pseudonym there, e.g., in [square brackets] or <seg>segments</seg>.
  • Replacing unique values with a summary statistic, e.g., the mean.
  • Rounding values, making the data less precise.
  • Replacing one or multiple variables with a hash.
    What is hashing?

    Hashing is a way of obscuring data with a string of seemingly random characters with a fixed length. It can be used to create a”hashed” pseudonym, or to replace multiple variables with one unique value. There are many hash functions which all have their own strength. It is usually quite difficult to reverse the hashing process, except if an attacker has knowledge about the type of information that was masked through hashing (e.g., for the MD5 algorithm, there are many lookup tables that can reverse common hashes). To prevent reversal, cryptographic hashing techniques add a “salt”, i.e., a random number or string, to the hash (the result is called a”digest”). If the “salt” is kept confidential or is removed (similar to a keyfile), it is almost impossible to reverse the hashing process.

Top- and bottom-coding

Top- and bottom-coding are mostly useful for quantitative datasets that have some unique extreme values. It means that you set a maximum or minimum and recode all higher or lower values to that minimum or maximum. For example, you can top-code a variable “income” so that all incomes over €80.000 are set to €80.000. This does distort the distribution, but leaves a large part of the data intact.

Adding noise

Adding noise to data obfuscates sensitive details. It is mostly applied to quantitative datasets, but can also apply to other types of data. For example:

  • Adding half a standard deviation to a variable.
  • Multiplying a variable by a random number.
  • Applying Differential Privacy guarantees to an algorithm.
  • Blurring (pixelating) images and videos.
  • Voice alteration in audio.

Permutation

Permutation means swapping values between data subjects, so that it becomes more difficult to link information belonging to one data subject together. This will keep the distribution and summary statistics constant, but change correlations between variables, making some statistical analyses more difficult or impossible.