Smart subsetting intro (part 1)

by Ekobit December 20, 2016
BizDataX smart data subsetting

Managing large databases within the testing environment comes with a cost. More often than not, testing teams require exclusive access to perform various testing activities which results in several copies of the original (large) database. If we assume our production database is 3 TB in size, the numbers stack up quickly – in time required to create and manage those databases and in hardware and licenses to run them on.

Not only that, when the teams run their tests, it makes a big difference if they run it on a table containing 1.5 billion records or on a table containing only 150 000 records.

Wouldn’t it be great if we could reduce the size of the database – let’s say to 10% of the original size – and have sensitive data masked alongside? Here comes the good news, BizDataX can do exactly that, it can produce a small target database for testing teams containing both realistic and consistent data. So, how do we achieve that starting from a complex database model containing 1000+ tables and tables containing millions of records?

Database model

Easy way to define a subset of our database is to say:

Let’s take only 10% of customers and all the data in other tables that has anything to do with selected customers.

Selecting 10% of customers is easy enough, but what about this ‘anything to do with customers’ data?

If we have a customer in the subset, we expect all related addresses and orders to be there as well. Similarly, for each order all related order details are expected and for all order details all related products are expected.

This is where database model comes in handy. We already have relationships among the tables defined – foreign keys, why not use that information to guide subsetting algorithm to extract all the data for us?

Simple example

The Customer table in our model has around 100k records. 10k of customers are from Germany, let’s subset our database to contain only those customers and related data.

What information do we have to provide to the algorithm?

  • Source database connection string
  • Target (empty) database connection string
  • Starting (seed) table(s) + filter

Here is the same information in BizDataX notation:

Subset settings example


If we configure a package with these subset settings and execute it, we will get our subset in the target database. Here is a comparison of number of records in source and target database.

We can see that most of the tables have around 10% of the starting data. Product table has more, because there’s only a small number of products and they’re repeated among orders.

Demo execution

Real world example

Here is an example of a real world banking database subset. In this case, the goal was to create a database with a small number of customers (2600), and take all the related data.

We can see that percentages in some tables are larger than expected, as a result of the fact that in the customers subset are included several banks and they have a large number of transactions.

Real world execution


Seeding options

Many options are available that we could use when selecting the records as input for the subsetting algorithm. We can take any given number of records, we can provide a filter to select them, or execute some custom code and provide the algorithm with the primary keys of records we want to include.

  • First N records
    • First 10.000 customers
  • Filter criteria
    • All customers from Germany
  • Key lists
    • All customers having keys in {key1, key2, key3, …}
  • Sampling/distribution
    • Random sample of 10% of customers
  • Complex query

To be continued

Follow our blog and find out more on advanced subsetting topic:

  • What are association principal and dependent tables, and how to configure them to be included/excluded from subset?
  • What is data explosion and how to prevent it?
  • What if we don’t have all the foreign keys defined on database level?
  • How to optimize algorithm performance?