Anonymization and Pseudonymization in the context of GDPR
Nowadays, anonymization and pseudonymization are rarely mentioned beyond the European Union’s (EU) General Data Protection Regulation (GDPR). Anonymization involves protecting sensitive information by erasing identifiers that associate an individual with certain data. Pseudonymization, on the other hand, is the processing of personal information in a way that prevents data from being attributed to a specific data subject, unless additional information is used. They may seem very similar to one another, but there are important differences. Before we explore each process, though, it’ll be worth clarifying some of the key concepts that are defined in the GDPR.
In order to control the ways in which personal data is collected, stored and processed across the EU, the GDPR was implemented in 2018. By regulating how organizations handle data, the regulation increases the rights of individuals, giving them more control over their personal information and, therefore, their privacy.
The GDPR defines personal data as any information relating to either an:
- identified natural person (someone who’s identity is already known)
- identifiable natural person (someone who can be identified, directly or indirectly).
The direct identification of a person involves using (almost) unique identifiers to determine the identity of a natural person. Such identifiers are a person’s name and surname and their phone number or social security number. In contrast, indirect identification uses one or more physical, genetic, economic, cultural or other characteristics that are specific to a natural person.
The GDPR states that personal data protection is a fundamental right of every individual and that the “processing of personal data should be designed to serve mankind”.
Processing of personal data
The processing of personal data is another key concept in the GDPR.
Personal data can only be collected for specified, explicit and legitimate purposes. For example, let’s say an art gallery uses the personal data of its patrons to run a workshop, which turns out to be a success. After the workshop, the gallery wants to use the same data for reasons unrelated to the workshop, such as sending the patrons unsolicited mail. Or perhaps contacting them in the hope that it can get them to sign up for another workshop. Either way, such use of data after the workshop would not be allowed under the GDPR.
Organizations are also not allowed to collect data that is irrelevant to their activities. Using the above example, the same art gallery would not be able to collect medical data on its patrons, as such information would be irrelevant to their work.
Lastly, the GDPR states that data should be processed transparently and stored in a manner that ensures appropriate security. Such security includes storing data only for as long as it is necessary. Taking the example of the art gallery, this means two things: first, that the gallery would make its data storage and processing policy clear to its patrons; and second, that the data obtained from the patrons would be erased once the workshop had ended.
It may seem that there are so many things to watch out for when using personal data, but the GDPR does offer recommendations on how to handle personal data. One such recommendation is data anonymization – rendering data anonymous in such a way that natural persons can no longer be identified. The key benefit of data anonymization is that the processing of such anonymous data does not concern the GDPR as it falls out of scope.
The process of making data anonymous consists of two key points:
- Data must be stripped of sufficient elements so that it can no longer be used to identify a natural person by using “all the means likely reasonably to be used”
- The process must be irreversible.
Returning to the example of our art gallery, if the gallery were to anonymize the personal data it holds about its workshop participants, the process would look something like this:
It’s important to note that the GDPR does not specify how, exactly, such a process should, or could, be performed; instead, the focus is on the outcome: personal data that has been appropriately anonymized. The regulation does, however, emphasize that anonymization techniques can provide privacy guarantees, but only if their application is engineered appropriately and implemented regularly. Moreover, given that practices in personal data re-identification are constantly being refined and improved, it’s also worth noting that anonymization techniques should be regularly reviewed and, where necessary, revised to ensure that stored data remains anonymized.
Now for some anonymization techniques! For an anonymization process to be successful, it’s often not enough to just erase elements of the data that directly identify an individual. An effective anonymization technique should, rather, encompass protection from the following key risks:
- Singling out – the ability to isolate some, or all, of the records that identify an individual
- Linkability – the ability to link at least two records concerning one individual
- Inference – the ability to deduce, with significant probability, the value of an attribute from the values of other attributes
That being said, the first step in any anonymization process should start with the removal of any data that directly identifies an individual, such as their name and surname, a photograph and their physical, IP or email address. Here’s what the data from our art gallery example could look like after the removal of direct identifiers.
To complete the anonymization process, generalization and randomization techniques should also be applied.
As you may have guessed, generalization techniques are used to generalize values of certain attributes. In the example below, the workshop price at our art gallery has been replaced with the workshop price range.
By doing this, it’s now much harder to re-identify the data as many records share the same attribute value. It’s worth noting, however, that the degree of generalization should not be set too high, as doing this will make every record fit into the same range, rendering the entire attribute useless.
Randomizations change the truthfulness of data, weakening the connections between data and individuals while still keeping the data useful. Such techniques include, but are not limited to, noise addition techniques and permutation techniques.
The purpose of noise addition techniques is to add ‘noise’ to an individual attribute’s values, making them imprecise without changing the overall distribution of the values. As an example, we could add or subtract a few dollars to or from our art gallery’s workshop price.
In contrast, as the name suggests, permutation techniques permutate the values of an attribute, weakening the connection an entire record has with a specific attribute.
In spite of their usefulness, however, these techniques can be tricky, especially the permutation technique. There’s no guarantee that it’ll weaken all the connections that the permutated data has with its original records, as illustrated in the example below. After applying the permutation technique, the first record says that the workshop How to substitute humans in classical paintings with common household items? was held in the Art Forge gallery. But, using data obtained elsewhere, one could find out that this workshop was organized by Acid Giselle’s Art Gallery. With this information, even after permutation, it would still be possible to determine the workshop’s location.
The other key process that’s used in data protection is pseudonymization. Pseudonymization reduces the likelihood of linking certain data with the original identity of a natural person.
The conceptual definition of pseudonymization can be summarized in few key points:
- Data has reduced linkability with the original identity of a person
- The process is a useful security measure but is not a method of anonymization
- The process can be reversed – that is, the data can be re-identified
In practice, this means that:
- Direct identifiers are replaced with pseudonyms
- Data that links pseudonyms and identifiers is stored at a separate and more secureIn spite of the apparent similarities between the two, it’s worth noting that pseudonymization is a process in its own right, not another form of anonymization. This means that pseudonymity can lead to re-identification, which, in turn, means that pseudonymized data is subject to the GDPR.
Getting back to our art gallery example, pseudonymized data could look something like this:
Unlike anonymization, at first it can be unclear how pseudonymization helps with data protection. But a closer look reveals that its benefits are twofold: first, it increases the security of sensitive data by using data that was stripped of identifiers in regular operations; and second, it allows data to be processed in any way, provided that its sensitive elements are replaced with pseudonyms.
Here comes the most interesting part: the many techniques of pseudonymization and their categorization! We’ll categorize them according to the way they generate pseudonyms:
- Techniques where pseudonyms are independent of the original data
- Data masking
- Techniques where pseudonyms are dependent on the original data
- Encryption with a secret key
Techniques with pseudonyms independent of the original data
We’ll start with one of the simplest pseudonymization techniques: tokenization. Tokenization uses an independent algorithm to generate pseudonyms – called tokens – and assigns them to records. For instance, tokens can be randomly generated numbers as demonstrated in the table below.
Another pseudonymization technique, data masking, generates pseudonyms in a different way. This technique replaces old attribute values with new ones but retains the format and the semantics of the relevant values. This means that the structure of such pseudonymized data is, as shown in the table below, indistinguishable from the original data and, therefore, doesn’t restrict data manipulation in any way.
It’s important to note that the original data can sometimes be preserved, although, when it is, it’s not used in any further data processing.
Techniques with pseudonyms independent of the original data
Pseudonymization techniques that uses encryption create pseudonyms by encrypting identifying data with a secret key. This means that the only extra bit of information that needs to be stored, besides the pseudonymized data, is an encryption key. This technique is one of the safest ways to pseudonymize data, given that the encryption algorithm used is, indeed, secure, and there’s no reasonable way to obtain the encryption key.
To sum up…
And that, in a nutshell, is how anonymization and pseudonymization are used under the GDPR. As we’ve seen, each process has its advantages and disadvantages. Some techniques, such as generalization techniques in anonymization, may be just what your organization needs, while the usefulness of others may require some more research and testing. Either way, we do hope this snapshot has given you a taste of the vast world of anonymization and pseudonymization – and an idea of what might be right for you.