Statistical approaches to de-identification

In order to protect datasets that contain personal or otherwise sensitive data, there is an increasing number of statistical approaches to de-identification, which to some extent quantify how identifiable data are after de-identification.

In this chapter, we discuss the following approaches, as these are the most widely used approaches:

K-anonymity
L-diversity
T-closeness
Differential privacy

These approaches (or: privacy models) are not yet much used in research practice, because they come with some disadvantages and require resources and/or expertise to be applied and interpreted correctly. However, they are used in many de-identification tools and are useful to detect specific sensitivities in (tabular) datasets. For those reasons, the techniques are introduced in this chapter.