Frequently asked questions for data masking tools

by Filip Pelivanović January 28, 2022

Data masking – also known as data anonymization – is as relevant today as it has ever been, given the new data privacy regulations that have been coming up every year in various countries and territories. From the EU’s GDPR and Brazil’s LGPD to California’s CCPA and Serbia’s ZZPL. Not complying with such regulations, as well as not making enough of an effort towards protecting personal and sensitive data, can result in heavy fines. Needless to say, when data has been compromised, an organisation’s reputation can also be damaged.


We spent the last year talking to companies in various industries spread across Europe and the US to gather information on the kind of operational support they need. As our correspondents told us, many of the companies wanted to become and remain compliant even before the relevant regulations came into force, mostly because of information security. But if this is the case, why do so many companies still struggle to execute of the relevant regulations?

So what does data masking actually involve? Data masking is an activity that aims to anonymize personal data by generating entirely new, but coherent, data with the same format, type and width as the original. When data is shared with other users in non-production environments, organizations must mask the data to avoid exposing critical information.

Apart from new data and privacy regulations, we also noticed that so many companies in several industries still struggle with anonymizing personal data in their databases. From an information security point of view, data anonymization is of interest to companies even if you take the regulations out of the equation. The testing teams, database admins and developers in these companies still urge their superiors to put more effort into the data anonymization process, to transform their current processes so that they can become compliant. Many of these companies are still using production clones that are not anonymized. Others often use masking scripts or other improvised tools to anonymize data, which only partially solves some of the challenges.

What are the things you still struggle with when it comes to data anonymization? Here’s a list of some of the questions we have often been asked, as well as explanations of how our BizDataX data masking software works.

Usual steps in data masking

The first step is to locate data, often known as data discovery. The tool you use should be capable of going through a database and detecting sensitive data regardless of its structure or size. You need to label sensitive data once you have defined rules for tables, columns and data within the selected database. Predefined rules can be run again in the next sequence of data anonymization, so once you set the rules you can follow the procedure repeatedly.

The most used databases

We noticed that most companies we talked to use the Oracle database. We also discovered that many of them use My SQL Server, DB2, Informix, PostgreSQL and other databases as alternatives. Regardless of the database you use, you should be able to anonymize its data correctly and without challenges. BizDataX interacts with all the mentioned databases, as well as many others.

What data to mask?

These often include first and last names, addresses, credit card numbers, social security numbers and other personally identifiable information. Searching through thousands of tables, databases and systems, you are going to find a whole lot of personal data, out of which you can specifically label some of them as sensitive. Different units need to collaborate on this, especially when there’s doubt regarding the sensitivity of the data that DBAs, testers or developers have found.

What’s important is the consistency and referential integrity of the relevant data for future masking. When you mask first names, for example, ‘Mark’ should always become ‘Josh’ across all databases, which is why deterministic masking is desirable.


Do you need to mask and subset data?

The process of keeping or extracting a smaller section of a data set from a huge database is known as data subsetting. When utilizing production data in testing, one of the most difficult technological hurdles is the large volume of data. To address this, create realistic test databases that are small enough to allow for quick test runs but big enough to accurately represent the variety of production data.

It costs money to manage massive datasets in a testing environment. The rationale for database subsetting can be reused via a web application available to business users once BizDataX is implemented. With the click of a button, users can set parameter values and create a new instance.

Database subsetting also allows you to decrease infrastructure and database query costs because smaller test databases might have a direct impact on your profits.

Data masking performance

How long does the execution of a masking script take? Let us assume, for example, that there are one million records and three fields (e.g. ID number, first name, last name). The main limit is the server speed, so in some places it is 40,000 records per second, while in others it’s up to 200,000  records per second.

Are you worried about the performance execution? Add a degree of parallelism here. To assure optimization of the masking performance, add more processors.

Remember, masking and subsetting sensitive data in non-production environments will help improve information security and significantly minimize compliance and infrastructure costs.


Are you curious about how to improve your data anonymization process overall?

Ask yourself and your team the following questions to get a better understanding of how tools like BizDataX can help you.

  1. Which databases do you want to anonymize? Oracle, SQL, or others?
  2. How many databases are you going to anonymize in the first run?
  3. What is the size of your databases (approximately)?
    • How big are your databases?
    • What is the amount of data in your largest table(s) that contains sensitive data?
    • What is the number of tables/columns (approx.) you need to mask?
    • What are the types of data you need to mask?
  4. Do you have information on where all the sensitive data is?
  5. Do you already have a specification for data masking?

Related Articles