Sensitive data discovery

by Filip Levačić October 26, 2020
Sensitive Data Discovery

When data anonymization becomes a necessity, our focus usually turns to picking the right data masking tool – one that will mask data correctly and in a timely manner while also smoothly integrating data anonymization into other business processes. Because regulations have to be met and a solution is needed at once, plenty of important steps in the data anonymization process are skipped over or even completely ignored. Preparing the process by specifying which data has to be masked, in what way and in which circumstances can save a lot of time during the actual anonymization process. However, knowing which data in your database is actually sensitive is a crucial step in defining a specification. Analyzing a database in order to find sensitive data is what we call sensitive data discovery, and this blog post will explain the basics of this activity. In particular, we’ll look at how sensitive data discovery tools are used, who should use them and the usual pitfalls to be aware of when using them.


What is Sensitive data discovery?

Sensitive data discovery (SDD) is the process of analyzing a database (or another kind of data source) in order to find potentially sensitive data. Sensitive data can range from various ID numbers, important dates, names, contact information or any other kind of data that can be used to uniquely identify a person. We stress the word potentially because whether certain data is sensitive or not often depends on the particular circumstances. A first name, for instance, isn’t necessarily sensitive by itself, especially when presented without a specific context. However, when combined with other information such as the last name and an address, it can be used to identify a person and, therefore, can become sensitive. Similarly, first names that are relatively rare or unique can also be considered sensitive.

In the data anonymization process, SDD is one of the first steps that has to be taken. Even if you think you know your database, sensitive data can often be found in unexpected places, especially if the data has been entered by many different users. SDD tools can provide a clearer view of the database and help define a proper masking specification. A masking specification is a document that contains information on data columns that have to be masked, including guidance on how to mask them, under what conditions, in which order and so on. Such a document can save a lot of time when data anonymization algorithms are being implemented. The first step in creating a masking specification is detecting columns that have to masked, and SDD does just that. SDD results can even be used as a basis for creating a masking specification, saving precious time.

SDD can also be used separately, and its results can inform processes other than data anonymization. Data can be considered sensitive in scenarios other than those in GDPR regulation compliance, such as data security. Detecting business data that can be dangerous if obtained by a malicious third party is the first step in providing more security to databases and servers containing such data. Moreover, gaining greater insight into the data patterns of your data is always useful, especially in large and legacy databases that have seen many data administrators come and go. While simply querying the database can provide some information about the kind of data you store, it can be tricky querying for data patterns or for data whose presence in the database you’re not aware of. With SDD, however, it becomes easy to find various data patterns within your database, giving you a greater understanding of its contents. Once you’re familiar with your database’s data patterns, querying becomes far more effective, as you’re able to make your queries more precise.


How does it work?

Sensitive data discovery BizDatax Filip Levacic


In order to determine whether a column in a database contains sensitive data, its data has to be analyzed. While the best analysis would involve analyzing every database record, with large databases such analysis can be time-consuming and, often, pointless. A smaller data sample of several thousand records is often more than enough to run a comprehensive analysis. Different ways to extract a sample from a database can affect the speed of the analysis and the data quality, which makes choosing the right data sample an important step. There are two ways to obtain a data sample, which represent extremes in analysis speed and data quality, with other methods falling in between these two ways.

The first of these key data sampling methods is taking a sample of grouped and indexed data, which makes it easy to access. This, in turn, reduces the overall analysis time. Although this method is fast, it produces a sample with similar data patterns, which can potentially make it less representative and lower its quality in the SDD process. For example, company policy could dictate that dates be entered in the transaction description, but if we take a sample from a period before the policy came into effect, these dates will not be detected, and the column containing them will not be marked as one containing sensitive data. However, if you only need to analyze and mask a certain section of your database, taking a sample from only that section could provide fast and reliable results, even with very large samples.

The data sampling method on the other end of the spectrum is random sampling. Taking data records from random locations in a database can be time consuming when compared to the  method we just looked at, but it does provide an unbiased sample that’s representative of the entire database table. A representative sample is very important in situations where you are not sure what your database contains, or where you want to run a second comprehensive analysis to compare with your own. The downside of this method is that, if you’re taking a large data sample, it tends to slow the process down. That being said, with the processing power of today’s computers, taking a random sample of, say, a hundred thousand, shouldn’t be much of a problem.

Both methods have their advantages and disadvantages, but they represent the extremes of the analysis speed/data quality spectrum. Custom sampling methods, such as ones using probabilities, can be used if you want a method that falls between these two extremes. The size of a sample can affect the results, with samples of several thousand (or several tens of thousands) records having proven to be a reliable choice in our experience. It should be remembered, though, that results can also depend on the size of a database and its tables. The default option for BizDataX SDD is random sampling of ten thousand records. This has proved to be the most reliable sampling method as it provides the best of both worlds: a fast sampling time and a representative sample.



Once we have acquired a data sample, we have to analyze the data to determine which column contains sensitive data. A simple way to do that is to look at table metadata. If, for instance, the name of a column is City, it is highly likely that it contains the names of cities,  which can be regarded as sensitive as they are often a part of an address. Other types of sensitive data can also be stored in columns that have easily distinguishable names in the database, such as FirstName, Email, Phone and so on. Table metadata alone, however, can never – and should never – be used as a final result, but it can complement the analysis of  actual data.

Analyzing the data in a sample itself, however, can be more challenging. Various ID numbers, such as driver’s license numbers or social security numbers, can  often be difficult to distinguish for humans, but they are also the easiest type of data to discover. They usually have a defined length and provide a control number that is calculated in a very specific manner and can be easily used to detect these types of data. E-mails and phone numbers follow a strict format and are also fairly easy to detect. Discovering dates is also pretty straightforward due to a limited number of standard formats, but problems can arise if a date’s components (days, months, years) are saved in separate columns instead of being kept together according to a prescribed format. Dates that are saved in unconventional formats can also be problematic.

First and last names are types of sensitive data that are probably the simplest to detect with the naked eye, but there’s no format or rule for a computer to follow in finding them. One way to solve this is by comparing values suspected of being first or last names with values from a large list of first or last names. These lists can also become useful during the masking process, as the names from such lists can be used as substitute values for masking real names. They can often be obtained from statistics bureaus or similar organizations. With a large and comprehensive list, results can be pretty accurate, as even if, say, 70% of records are found in the list, we can, with confidence, say that the analyzed column contains first names. We should, however, be wary of 100% accuracy results when dealing with these kinds of data, even if such wariness seems counterintuitive. This is because we can never assemble a list containing every possible name, so apparent  perfect accuracy in a large sample is usually something that we should check. For example, a table containing a log of Oracle database errors will contain the string ORA in every record, and with Ora being a common name in certain countries, the SDD process will say that every record contains a name. A simple glance at other information, such as the table name, however, will tell us that this is not the case. Such an example can serve as a cautionary tale to carefully analyze and use results, instead of blindly trusting them.   

Addresses are a type of sensitive data that can combine every other type of data previously discussed. Names of countries and cities can be found in easily obtainable lists. ZIP codes can be found in lists, and there are even country-specific formats. House or floor numbers are almost impossible to detect within sample data, as they can’t be distinguished from other simple numerical data; but all of these types of sensitive data often have meaningful metadata and are mostly found near one another in a database.

While you can’t account for every known data type, along with the specific logic targeting each type, many data patterns can still be discovered using more general techniques. Searching for certain keywords, custom lists of values or Regular Expressions can still produce a lot of information that can be useful when working with data patterns in your database.


In short…

To recap, SDD is used to find potentially sensitive data in your database. It can be used to search for data patterns to simply gain a better understanding of your data. In the context of data masking, its main use is providing information that helps you to decide what parts of your database need to be masked. SDD tools can provide many useful features in detecting sensitive data, but at the end of the day they are just tools. They cannot and should not be used in an automatic process; instead, they just provide the information that a user needs to decide whether certain columns should be masked or not.

It’s best that SDD is done by someone who knows the relevant database well as their knowledge can complement the features that SDD tools provide. This will ensure that the analysis is done as thoroughly as possible. Production databases often contain massive amounts of data that can be decades old and which were  entered by multiple users. Because of this, even the most skilled analysts can benefit by using a good set of data discovery tools in their work. Such analysts must make the most of the information provided by the tool they’re using in order for the resulting masking specification to be as good as possible. A thorough SDD is an important prerequisite for a quality masking specification, which, in turn, can help make the anonymization process simpler and faster. It can also help you avoid unexpected and costly surprises.

BizDataX Discovery tool could be completely separated from the data masking process, but its main value lies in it being the basis of a good masking specification. The results of SDD can be analyzed and converted to a basic masking specification, which can then be further improved and refined in the BizDataX Portal as well. Good, step-by-step documentation is provided to guide you through the process to make the entire masking process as fast and painless as possible.

Find out how can you combine Sensitive Data Discovery with other solutions

Related Articles