Masked data? Synthetic data? Or both?
Copies of production databases with real data can make your work easier, whether you are aiming to protect the privacy of your users, or are simply trying to test your application with a large amount of production-like data. Because of this, they’re used frequently and at all stages of software development. Such preparation of test data is the fastest and gives the best result for all participants. But the latest regulations governing the protection of sensitive data and user privacy (GDPR) no longer allow it. And so we need to find a solution that maintains the quality of data without compromising development processes. There are three options: data masking, creating synthetic data, and a combination of both.
What is data masking?
Data masking involves replacing sensitive data with fabricated, yet realistic, data. This is done to preserve the integrity of the data and, once completed, it’s impossible to reverse the data. Apart from software development and testing, it’s a process that’s also often used in non-production environments, user training, and other similar areas.
Data masking doesn’t change the database that contains the original data. Instead, a copy of the database is made – whole or subset – then masked and being used for a specific purpose.
Here’s the result of some simple data masking. As you can see, the values have been changed in a way that makes their unmasking impossible. The structure, however, is the same, which makes the masked data look genuine.
Masking data – is it really that simple?
Before you start masking your organisation’s data, it’s worth considering some questions:
- Will I be sure I’ve masked everything that has to be masked? When you’re working with a copy of a production database, you don’t always know what, exactly, is in it.
- Will the integrity of the database be preserved? The application must also work if it is handling a new database, with masked data – for example, the same first and last name in a database will probably have to be masked, wherever they appear.
- What domain and application knowledge will I need? Such knowledge will help you determine how to mask the data, yet at least some of this information won’t be in the database itself. Instead, you’ll probably have to look for it elsewhere. The database does not know everything.
Masked data, or synthetic data?
In spite of data masking’s popularity, it may sometimes be better to produce new data – synthetic data – from scratch instead of masking existing data. When deciding between the two, it’s important to determine the kind of data that’ll be needed, as well as the time and cost that’ll be required to prepare it. In my experience, masking is a faster process than generating new data because you only have to put effort into producing replacement values for sensitive data, rather than for data in general. Masking is also easier because, once you’re done, you can use the database structure and the rest of the data for other purposes. In contrast, when you produce new data, you have to generate a complete set that’s meaningful, which includes asking various questions such as “Is this field numeric or text?”. In short, unless you’ve got very specific requirements, and a lot of time on your hands, data masking is the way to go.
That being said, there are certain situations where creating synthetic data might be your only option. For example, if you’re working on a new application, it won’t be possible to mask the data because the data itself obviously hasn’t been created yet. Because end users won’t be able to use the application, your production database won’t contain the data that’s needed to test new features. In this case, the only bet will be to generate synthetic data.
At other times, if huge amounts of production data exist, a combination of data masking and synthetic data might be best. Situations that merit such an approach include when a certain part of the production data cannot be masked for some reason but needs to be thrown out and replaced with new records. Another example is when an existing data set is incomplete and, therefore, not good enough for some test cases. The data set will then have to be supplemented with newly created synthetic data.
In short, the solution you choose for your data needs will depend on your situation. It’s a fair bet that data masking might be your first choice, as it’s a lot easier to mask existing data than to create data from scratch. By its nature, data masking is also better at representing – or mimicking – the functions of your original data. But as we’ve seen, there are also situations where creating synthetic data might be your only viable option. In the end, it may very well be a combination of both that works best for you – perhaps data masking supplemented, where appropriate, with some synthetic data.
To find out more about the possibilities of data masking and synthetic data, book an appointment and talk to us directly.