K-anonymity, l-diversity and t-closeness

On this page: k-anonymous, l-diverse, t-close, privacy model, quantifying privacy, key attribute, sensitive attribute, quasi-identifier
Date of last review: 2023-05-30

K-anonymity, L-diversity and T-closeness are statistical approaches that quantify the level of identifiability within a tabular dataset, especially when variables within that dataset are combined. They are complementary approaches: a dataset can be k-anonymous, L-diverse and T-close, where k, L and T all represent a number.

Identifiers, quasi-identifiers, and sensitive attributes

Privacy models like k-anonymity, L-diversity and T-closeness distinguish between 3 types of variables in a dataset:

  • Identifiers (also known as key attributes): direct identifiers such as names, student numbers, email addresses, etc. These variables should in principle not be collected at all, or removed from the dataset if they are not necessary for your research project.
  • Quasi-identifiers: indirect identifiers that can lead to identification when combined with other quasi-identifiers in the dataset or external information. These are often demographic variables like age, sex, place of residence, etc., but could also be something entirely different like physical characteristics, timestamps, etc. In general, quasi-identifiers are usually variables that are likely to be known to someone in the outside world.
  • Sensitive attributes: variables of interest which should be protected, and which cannot be changed, because they are the main outcome variables. For example, it can be Medical condition in a healthcare dataset, or Income in a financial dataset.

It is important to correctly categorise the variables in your dataset as any of these variable types if you want to apply k-anonymity, l-diversity and t-closeness, because they will determine how the dataset will be de-identified.

How it works

K-anonymity

K-anonymity ensures that each individual in a dataset cannot be distinguished from at least k-1 other individuals with respect to the quasi-identifiers in the dataset. This is done through generalisation, suppression and sometimes top- and bottom-coding. Applying k-anonymity makes it more difficult for an attacker to re-identify specific individuals in the dataset. It protects against singling out and, to some extent, the Mosaic effect.

Original dataset
Nr Age Sex City Disease
1 16 Male Rotterdam Viral infection
2 18 Male Rotterdam Heart-related
3 19 Male Rotterdam Cancer
4 22 Female Rotterdam Viral infection
5 22 Male Zwolle No illness
6 23 Male Zwolle Tuberculosis
7 24 Male Zwolle Heart-related
8 25 Female Utrecht Cancer
9 26 Female Rotterdam Heart-related
10 28 Female Utrecht Tuberculosis
2-anonymous dataset
Nr Age Sex City Disease
1 =< 20 Male Rotterdam Viral infection
2 =< 20 Male Rotterdam Heart-related
3 =< 20 Male Rotterdam Cancer
4 20-30 Female Rotterdam Viral infection
5 20-30 Male Zwolle No illness
6 20-30 Male Zwolle Tuberculosis
7 20-30 Male Zwolle Heart-related
8 20-30 Female Utrecht Cancer
9 20-30 Female Rotterdam Heart-related
10 20-30 Female Utrecht Tuberculosis
Colours indicate an ‘equivalence class’ of quasi-identifers

To make a dataset k-anonymous, you must first identify which variables in the dataset are identifiers, quasi-identifiers, and sensitive attributes. In the example above, Age, Sex and City are quasi-identifiers and Disease is the sensitive attribute. Next, you should set a value for k. If we choose a k of 2, every row in the example dataset should have the same combination of Age, Sex and City as at least 1 other row in the dataset. Finally, you aggregate the dataset so that every combination of quasi-identifiers occurs at least k times. In the example, this was done by generalising Age into age categories, but there may also be other ways to reach 2-anonymity in this dataset.

There is no single value for k which you should always choose. The higher the k, the more difficult it will be to identify someone, but likely your dataset will also become less granular and perhaps less informative. The value of k will be highly dependent on what you communicated to data subjects (e.g., you may have promised a certain k) and the risk of identification that you are willing to accept.

The below video gives an example on how k-anonymity can work in practice:

L-diversity

L-diversity is an extension to k-anonymity that ensures that there is sufficient variation in a sensitive attribute. This is important, because if all individuals in a (subset of a) dataset have the same value for the sensitive attribute, there is still a risk of inference. For example, in the below 2-anonymous dataset, you can infer that any female from Rotterdam between 20 and 30 who participated had a viral infection (“homogeneity attack”). Similarly, if you know that your 25-year old female neighbour from Utrecht participated in this study, you learn that she suffers from cancer (“background knowledge attack”).

2-anonymous dataset
Nr Age Sex City Disease
1 =< 20 Male Rotterdam Viral infection
2 =< 20 Male Rotterdam Heart-related
3 =< 20 Male Rotterdam Cancer
4 20-30 Female Rotterdam Viral infection
5 20-30 Male Zwolle No illness
6 20-30 Male Zwolle Tuberculosis
7 20-30 Male Zwolle Heart-related
8 20-30 Female Utrecht Cancer
9 20-30 Female Rotterdam Viral infection
10 20-30 Female Utrecht Cancer
Colours indicate an ‘equivalence class’ of quasi-identifers
2-anonymous 2-diverse dataset
Nr Age Sex City Disease
1 =< 20 Male Rotterdam Viral infection
2 =< 20 Male Rotterdam Heart-related
3 =< 20 Male Rotterdam Cancer
4 20-30 Female
Viral infection
5 20-30 Male Zwolle No illness
6 20-30 Male Zwolle Tuberculosis
7 20-30 Male Zwolle Heart-related
8 20-30 Female
Cancer
9 20-30 Female
Viral infection
10 20-30 Female
Cancer
Colours indicate an ‘equivalence class’ of quasi-identifers and sensitive attributes

K-anonymity does not protect against such homogeneity and background knowledge attacks. Therefore, L-diversity proposes that there should be at least L different values for the sensitive attribute per combination of quasi-identifiers. In the example above, if we choose an L of 2, that means that for each combination of Age, Sex and City, there are at least 2 distinct diseases. In the example, we suppressed City for these homogeneous cases, so that all females between 20 and 30 years old can either have cancer or a viral infection.

Like k-anonymity, there is no perfect value of L, although it is usually less or equal to k and more than 1.

The below video explains the concept of L-diversity using an example:

T-closeness

T-closeness ensures that the distribution of a sensitive attribute within a generalisation of a quasi-identifier is close to the distribution of the sensitive attribute in the entire dataset. In other words, it ensures that the sensitive attribute is not skewed towards a specific value within a group of similar individuals, which could potentially be used to re-identify someone. For example, if a dataset contains information on Age (quasi-identifier), Sex (quasi-identifier), and Income (sensitive attribute), and t-closeness is applied with a value of t = 0.1, then for each combination of Age and Sex, the distribution of income must be within 10% of the distribution of income in the entire dataset.

T-closeness can get complicated quite fast. If you’re curious to know how it works, the below video explains the concept of t-closeness using an example:

When to use

K-anonymity, L-diversity and t-closeness are usually applied to de-identify tabular datasets, before being shared. They are also most suitable for relatively large datasets (i.e., containing a large number of individuals), as more details (utility) are likely to be retained in such datasets (source).

Implications for research

  • It is very easy to lose a lot of the (granularity of the) data when satisfying the k-, L- or T-criteria: the higher the criteria, the lower the risk of re-identification, but the more information you lose. The balance between privacy and utility is therefore very important to take into consideration when applying these privacy models.
  • The more variables (quasi-identifiers), the larger the dataset and the more outliers there are in the dataset, the more difficult de-identification will be without losing too much information (as shown here).
  • If a dataset is k-anonymous, L-diverse or T-close, that does not mean that the dataset is also considered anonymous under the GDPR. The degree of anonymity after applying these approaches depends entirely on your own choices in terms of k, L or T, in terms of the variables that you included, and on the context of your dataset. For example, if you failed to include a quasi-identifier in k-anonymising your dataset, your dataset is in reality not k-anonymous.

Further reading