Anonymization with SQL scripts – masking or faking?
It has been a while since GDPR was hitting the headlines and one might think that privacy protection challenges are far behind us. Unfortunately, this is far from the truth, especially when it comes to testing and other staging environments serving the internal IT part of the organization. IT crews are still struggling because they were unable to escape the old way of doing things – using copies and clones of production databases containing real and sensitive data.
It seems that nowadays we have a ‘new normal’ which is to use some SQL scripts to mask only some of the sensitive information, usually the information that is easily spotted as unmasked, such as the name of a person. The rest of the sensitive data is routinely unchanged.
This is why homemade scripts failed to solve the problem. To make things even worse, organizations were being forced to lie that they solved the problem so many times that they also started to believe the lie.
Faking is now at an all-new level.
You shouldn’t be on your own and forced to take care of everything
We’ve been writing about script-related challenges and anonymization long ago. Going that route is hard:
- Masking involves implementing a number of algorithms, such as a credit card number generation and smart date-time manipulation.
- Additional development challenges arise, such as how to preserve referential integrity between tables, databases, and systems.
- SQL scripts are hard to develop and even harder to read and understand later which leads us to reusability and transparency issues.
- The best performance attainable with parallel processing is not available when using scripts. If you’ve met a person who could handle parallelism with scripts, please let us know, we would like to hire the genius (not kidding).
It may take weeks, if not months, before dedicated resources complete the first viable version of a data masking script. If an organization is unable to assign dedicated resources, it won’t just take more time, often it won’t happen at all.
DIY data masking is typically poorly documented and difficult to maintain in the face of growing data sets, evolving requirements, and changing personnel. IT environments change and management has other priorities, so it is common for scripts to be neglected and become unusable.
This is why 72% of scripts are not even functional.
With scripts one has to take care of too many things and has to improvise a lot. If you are really committed, you may be capable of completing the initial effort but will fall short with the maintenance in the long run.
Consistency is important
Consistently masking data in every table, database and system should not be Utopia.
Yet again, it is very hard to accomplish when we have to deal with specific column variations and types, for example separate columns for the first name and last name in a table one, the full name in a table two and a shorter version with the whole last name concatenated to only the first letter of the first name in a table where email addresses are stored. The data masking script would have to handle the variations and use the substitutions that fit.
Temporal consistency makes it possible to refresh the test environment and get the data that testers expect, not something completely new and random every two weeks. The masking process must be applied on every refresh, and for you to be able to prove that it really was, logging mechanisms for trackability, compliance and reference during audits is needed.
Being consistent and thorough with finding all the places where sensitive data may be stored is probably a reason enough to justify getting a specialized sensitive data discovery and tracking solution.
Databases were not designed with data masking in mind
Databases shine when it comes to storing huge amounts of data for a very long time or processing transactions that change only a few rows from time to time. Data masking is a whole different type of endeavour, one that requires changing all the rows, millions, and billions of them at once. SQL and scripts are database tools designed for transactions and they suck for data masking. If you’ve tried it you probably know that the execution of the data masking script is time-consuming, resource-intensive, and consequently costly and impeding for other activities that rely on affected database systems.
On the other hand, solid specialized data masking solutions are capable of processing huge numbers of database records without hurting the database that much. Hundreds of millions of records can be processed per hour while avoiding the uncontrolled growth of a database system’s transaction log. What happens if, after a significant amount of data has been processed, we run into an error? We would need magic in a script for it to be able to continue after an impasse caused by an error.
Specialized solutions can do exactly that, and the feature may be just one press of a button away.
Stop faking and start masking
While masking with DIY scripts may be better than doing nothing, the scripts are not the right solution for the anonymization challenge. They may serve the purpose as a temporary fix but beware even in that case because your team resources could be put to greater use than to fight with the challenges outlined above just to produce something that is more faking than masking.