k-anonymity, l-diversity, and t-closeness
There are a few methodologies for protecting privacy with data, especially if we are going to be publishing data or sharing it with others. For example, we may need to send data to a service like Amazon Mechanical Turk for data labeling, and we don't want to have a data breach as a result of sending the data there. The first methodology for protecting privacy is k-anonymity, which was first introduced in 1998. This says that if we have at least k records with identical tuples of quasi-identifiers (QIs) then we have k-anonymity (where k is a positive integer). QIs are PII that has been semi-anonymized. For example, age and zip code could make a tuple of quasi-identifiers by converting ages to ranges and removing the last few digits of zip codes.
As an example, let's look at the simple dataset in the GitHub repository for this chapter of the book. This is a mock dataset that has HIV test results of individuals. Let's first...