Data masking techniques: SQL scripts for data anonymization?
Recently, we have found ourselves in a discussion with a person that leads a testing department in a bank. We have been discussing the recent attempt by his organization to break up with the everyday practice of using copies of production databases for testing and development.
They were using copies for years and even though everyone was aware that a lot of people could access private information, no one was really making a big deal out of it until recently.
When someone finally realized that this was unjustly neglected, DBA team was requested to mask sensitive information before releasing copies to the rest of the world. They only had to mask some names and IDs in a few database tables, so they decided to write an SQL script. After some time, the things got complicated, the script became hard to maintain as they became aware of additional sensitive data that required some different data masking techniques.
The things were far from ideal with the execution of the scripts as well. Executions were resource intensive and started to affect normal database operations. Eventually, they started affecting not only testing and development but also production environments.
The point of the story is – experienced and skillful DBAs can write database scripts and solve the problem of masking sensitive data if things are kept simple. This, however, is rarely the case. Keep your eyes open, start with the big picture in mind, before taking that route. Here are the questions you should go through before writing a script and choosing between existing data masking techniques.
How complex could data masking (scripts) really be?
We want our masked data to be as good as the original, production data, right? To produce such data, our scripts must come with several masking algorithms that create data. Examples include names, birth dates, credit card numbers, account numbers, etc. Defining and developing such data masking techniques can take quite some time.
What about the referential integrity that needs to remain between tables and databases? This can be a challenging task if you consider that e.g. data about a certain customer can be stored in account, transaction and credit card tables, while at the same time records from other tables could hold references to it.
Anonymized scripts must perform well! We want millions or even billions of records to be masked in a reasonable time, i.e. hours. If a script execution takes hours, the script also has to provide some progress reporting because we don’t want to have a nervous DBA wondering about how much longer other database operations might be sluggish. In the end, we need verification to tell us if we masked the target data. Would that be a script number 2? More data masking techniques?
Ask yourself a question: how much time and effort are we willing to invest in developing and maintaining a system of scripts?
What about heterogeneous environments with different database types, such as Oracle and MS SQL?
In some special scenarios, scripts allow execution across different databases and database engines. However, due to the specific requirements put upon scripts for data anonymization (e. g. performance), it is more likely that data masking techniques such as scripts will be tightly coupled to the exact database and technology.
The worst-case scenario would be that your team must rewrite the script from scratch to be able to reuse the same masking logic on top of the other vendor’s database technology. Rewriting the scripts several times may seem doable, but what about the maintenance? Sure, it is still doable, but make sure you don’t underestimate the total amount of effort. For instance, Oracle databases are widely used and there are many options if you wan’t to mask data in Oracle databases.
Applications and data masking techniques evolve so scripts should evolve too!
Databases and applications are live organisms that grow and evolve continuously. New tables and columns of sensitive data may be added as part of the upgrade process. Data masking and the corresponding techniques should really be a part of the software life cycle. It requires not only database specialists, but also business experts, application programmers and testers, as well as security, auditing, and compliance professionals.
At some point, expect that the entire script needs a revision and update to accommodate new tables and columns added as part of an application patch or an upgrade. Changes introduced to source code and other IT system artifacts should be tracked using source control systems. We all know that scripts tend to become even less practical in that scenario.
The point is, maintaining the system of scripts for masking and validation may influence your day-to-day operations way more than you might think.
Whose responsibility is data masking?
Implementation of data masking belongs to the IT department in the organization, but the initiative will most likely start from the other end. Risk management, compliance and auditing departments will push IT people to find a way to protect sensitive data.
Since scripts tend to be monolithic programs, auditors would have no transparency into the masking procedures used in the scripts. The auditors would find it extremely difficult to offer any recommendation on whether the masking process based on a script is secure and offers the appropriate protection.
The introduction of the proper data masking technique should start with the analysis and design phases. First sensitive data is defined and business rules are set up. After that, the programming and testing phases define how to mask data for unit and system testing. If you think ‘scripts first’, you’ll most probably miss the target.
Is there a viable alternative to scripts?
If you are reading this article, it is very likely that you have missed the hype and you have not realized that data masking market has reached the midpoint of maturity. The adoption lags a bit behind the maturity of the technology, but it is getting momentum. Vendors and external service providers are thinking in terms of additional features for the users. Some complement data masking with test data management, others complement it with data security or application development.
Finally, it doesn’t really matter how good your organization is with developing SQL scripts. You’ll inevitably be unable to compete with dedicated data masking vendors in ever-changing requirements, regulations, and platforms.
You should probably conduct an appropriate selection process to make an optimal technology and vendor choice. There are many good vendors on the market, including BizDataX. Start by contacting us, and we will share our experience with you. Leave your information, and we will show what a specialized data masking tool can do for you.