How to use data subsetting to improve test data management

by Valent Cerovečki August 13, 2020

In IT, subsetting is the process of making a large database smaller in order to aid the development of software that uses databases for information storage, as well as bug fixing and testing the software. A subset is the result of subsetting.

Transforming a big database into many smaller subsets

In the beginning, it’s all about what the users want

Just as every good storyteller slowly builds the plot of his or her story towards its inevitable conclusion, when explaining technical concepts, we also start by fleshing out the answers to two of the most basic questions: who, and why?

The who would be a developer, who has an idea, or an idea is handed to him, and the database is the world, lore, and the canon where the story will be set. He has to work within it, follow its rules and in the end listen to the editor. Depending on the target audience he must limit the vocabulary and grammar structures he uses to enable the readers to enjoy his work. If it’s a children’s story it will have a completely different structure, pace, and words than if it’s an adult’s book, even if both have the same topic. We could call this process, subsetting of the volume of vocabulary the author has in order to help his readers relate to the story mode.

Almost every application today has a database behind it. Depending on an application’s scope, its corresponding database can be huge. But subsetting isn’t normally considered until a database’s capacity has reached about a hundred gigabytes. That doesn’t mean that smaller databases can’t or shouldn’t be subsetted; rather, it just means that, given the current cost of disk space, subsetting a small database might be overkill.

For example, if we take a live application whose database has a couple of terabytes of real data, and a developer who needs to implement a new feature or fix a bug, we can easily see some problems. For starters, he definitely shouldn’t do such work on a production database. To copy such a database to his local system or a test database, his local system or test database would need the same amount of disk space. But of course, our developer’s smart, so he just copies the data he needs.

But one day, he needs data from more tables, and those tables have their dependencies, and those dependencies rely on other tables and so on. He ends up writing hundreds of lines of SQL code to get the data he needs to his test database. If the data set is still large, or the production system has capping or throttling issues, the process will take up a lot of time and resources. It will be error-prone, and if the definition of the production database changes, the scripts will need to be refreshed.

In spite of these issues, our developers finish his work, and the code is ready to be tested. Now the tester needs to set up the data, to satisfy his own destructive needs. He also needs the scripts, but perhaps accompanied by some other data. Perhaps he’s testing the code on a mobile app, which needs more data to do extra housekeeping. Along the way, he remembers that he doesn’t know SQL, nor does he have any idea what the couple of hundred tables with names like T125_CnTaRf mean. He wastes some of the database developers and database administrators’ time and manages to test the code. It goes into production. Life goes on.

One day, a bug appears, and both the developer and the tester will need to revisit their old scripts and make new copies of the database in order to fix it. And so the cycle repeats itself.

If we add more developers to the team, then the option of a shared database becomes almost an impossibility. Who knows what that new guy or girl in the other team will do to the development database?

The plot thickens

Linked tables which need to be subsetted

How would one even start with subsetting a database? What tools to use if any to use? The possible answers fall into three broad categories. They all have their pros and cons, which we’ll briefly mention, but our focus will be on technical aspects and higher-level concepts.

1.  Dedicated tools for subsetting seem like the obvious choice when we need to make copies of huge databases. But almost all of them have one or more of the following problems:

  • They’re expensive
  • Using them properly requires a lot of technical knowledge
  • Users need a lot of domain knowledge about their database’s definition
  • They’re targeted at a specific database
  • If you find a sensible tool that someone with technical knowledge – but not necessarily SQL knowledge – can use, give us a shout.


2.  Making a couple of subsets and having them as backups somewhere, perhaps as a Docker image, is another possible solution. Each developer can restore a copy that best suits his needs and use it. Of course, someone will need to maintain those copies, and when the production database definition changes he will need to add the changes to the subsets and propagate the changes to the users. He will also need to make many subsets to cover various scenarios (eg. minimal, one user with a lot of transactions, performance).


3.  Finally, there’s the option of writing custom scripts. Although it’s the most powerful solution since we have complete control over every aspect of data transformation and migration. We believe it isn’t really necessary to explain the problems of writing custom scripts for a thousand table systems and a few dozen different scenarios.

Deus ex Machina

Logical entity level

To ease communication between various people in the subsetting process chain, we’ll outline an approach that we’ve found quite useful and versatile.

In talking about databases, it’s only natural to think of data in the context of rows in a table. While a row of data might have a meaning all of its own, it’s usually useless without some surrounding data to give it meaning. A bank transaction data row, for example, can carry some useful data, but in a broader sense, it’s of limited use without the data specifying who paid whom. Similarly, an address won’t be of much use without the name of the person who works or resides at that address.

Here’s where the concept of a logical entity comes in, which is a different way of looking at the data stored in rows. A logical entity named ‘customer’, for example, could derive their first and last name from a Customer table, their home city from an Address table, their email address from a Contacts table, and so on.

A logical entity is also domain-oriented, depending on the perspective of the person looking at it. For a customer service representative, a logical entity could have just a first and last name and a contact phone number. For a technician, it might also have an address where a customer is located.

If we were to add more and more domain views to a customer, we would eventually get the entire database definition. Because of this, it’s more practical to differentiate logical entities solely on needs. With this approach, our previous examples could be named ‘Customer for CS’ and ‘Customer for TS’.

As we add more and more logical entities of different types, we get to the semantic level of data definition. At this, semantic level, a customer simply has transactions, instead of table ‘Customer’ being linked to table ‘Account’ which is linked to table ‘Transaction’ via either the ‘payor’ or ‘payee’ key.

We can also add specializations to our logical entities, so the logical entity ‘customer’, for example, could have four specializations based on the date of birth, underage, senior, and adult. Such specializations can again help in subset explanation and definition by expressing the request in simpler terms. (e.g. Give me transaction records of customers in the senior – underage range around the time of any underage customer’s birthday.)

The resolution

Now that we’ve set everything up, we get to the heart of the matter. We can divide the responsibilities and roles of people without much overlap. The tester doesn’t have to do his SQL queries; the database developer doesn’t have to know the business logic, and both of them don’t have to explain in great detail to one another what they need.

We can divide the subsetting application into various parts. On the most abstract level, the owner of the test can specify his logical entities and how they interact, without providing any details. She could, for instance, say that:

  • a customer has a first name, the last name, date of birth and credit card accounts
  • a company is a business account
  • company A is a company with the registration number XXX
  • a transaction has an amount and a date
  • my wanted subset focuses on customers that paid an amount above $100 to company A.

It is then up to the database developer or administrator to map all the logical entity parameters to the actual tables. An example of a logical entity customer could be:

  • a first name, last name and date of birth from a ‘Private Customers’ table
  • a credit card account from a ‘Credit Cards’ table that’s linked to an ‘Accounts’ table through a foreign key, and the ‘Accounts’ table’ is linked to the ‘Private_Customer’ table via another foreign key.

The subset is the logical entity ‘Customer’, which is linked to the logical entity ‘Transaction’ through the ‘Payer_ID’ column, which, in turn, is linked to the ‘Transaction’ entity through the ‘Payee_ID’ column.

Later on, the subset designer can easily add a couple of more properties to the logical entity and notify the implementation to update the definition.

Once the subset has been defined and implemented, it’s a matter of pressing a button, selecting the source and target, and waiting for the progress bar to finish. The rest of the subsetting process is a matter of technical implementation, but the main benefit of the laid-out process is the separation of responsibility, the ease of communication, the centralized place of definition and storage, and the versatility.

If all of this sounds logical and interesting, we’re on the same wavelength.

Do you have any questions regarding subsetting and/or data anonymization software?

Related Articles