Table of Contents
Introduction to Data Masking
What is Data Masking
Data masking is a method of creating a structurally similar but inauthentic version of an organization’s data that can be used for purposes such as software testing and user training. The purpose is to protect the actual data while having a functional substitute for occasions when the real data is not required.
Data is said to be masked if the original value is replaced with a similar value keeping the format and semantics of the data same. The data may be altered in a number of ways, including encryption, character shuffling, and character or word substitution. Whatever method is chosen, the values must be changed in some way that makes detection or reverse engineering impossible.
Types of Data Masking
There are two two types of Data Masking :
- Static Data Masking
- Dynamic Data Masking
Static Data Masking
Static data masking is used by organizations when they create test, QA or development environments. Organizations often outsource the development and testing to contractors or other IT companies in a different geographical location. While doing so, the organizations must provide the developers a production like data with which testing can be done to ensure the system being developed will work as intended in the production and at the same time they must not leak any customer sensitive data. Static Data masking is the only solution to ensure that sensitive data is masked before sending it out of the organization’s production environment.
Dynamic Data Masking
Dynamic data masking is designed to mask data in real time for live production systems. Dynamic data masking masks all sensitive data when it is accessed, in real time and hence the sensitive information never leaves the database. This way, when an authorized personnel tries to access data in the production database, the data shown to him is masked, and the real data is never exposed.
Why Mask Data ?
Legal Requirements
The regulatory environment surrounding the duties and obligations of a data holder to protect the information they maintain are becoming increasingly rigorous in just about every legal jurisdiction.
Loss of Confidence and Public Relations Disasters
It can reasonably be said in most locations, that if a data escape happens at your organization, then the formal legal sanctions applied by governmental bodies is not the only problem you will be facing.
Malicious Exposure
Most people think the major risk to the information they hold is external entities out to break in and steal the data. The assumption then follows that protecting the network and firewalls is the appropriate and sufficient response. There is no denying that such protection is necessary – however it has been shown that in many cases the data is stolen by malicious insiders who have been granted access to the data.
Accidental Exposure
The risk of accidental exposure of information is often neglected when considering the security risks associated with real test data.
Key Requirements of Data Masking
Data Masking should be irreversible
- Once the Data is masked there should be no way to get back the original data
- No correlation between input and masked data
- Data getting masked should not have any correlation such that the original data can be traceds
Preserve the semantics & structure of original data
- Masked data should preserve the format of the original data. Numbers should get masked to Numbers only and Alphabets with alphabets only.
There should be no way to differentiate masked data from original data
- Masked data should not look different from Original data such that someone can make out the difference.
What data to mask
Following are some of the categories of data that are to be masked :
- Names (First Name, Last Name, Full Name, Initials)
- Email Addresses
- Identification Numbers(SSN, Driving License, PAN Card, Aadhar Card, Passport etc.)
- Face, Finger prints
- Medical Reports, X-Ray Images
- Credit Card Numbers
- Date Of Birth, Marriage Date
- Addresses
- Bank Account numbers
- Bank Balance
Where does sensitive data reside
Sensitive data can be found in any of the following data stores
- Database like Oracle, Sybase
- Files
- XML files
- Messages (SWIFT etc.)
Software systems are inherently complex and data is often transferred between internal systems. Data is even transferred between companies and even geographically between countries. Hence it is of utmost importance that we protect the sensitive data.
Data Masking tool Architecture
In simplest a data masking tool contains the following components
Source data store
- This can be database, file or messages or any other store which contains sensitive data.
Masked data store
- This again can be database, file where the masked data is stored after applying data masking.
A data masking tool which contains the following components
- Masking engine
- Policy data store
- Algorithms repository
Data Masking process
Following are steps involved in data masking process
Data Discovery
Process of identifying fields which contain sensitive data. For example tables & columns containing sensitive data in a database.
Application Configuration
Configuring the application into the data masking tool and applying the appropriate algorithm.
Running the masking process
After configuration the masking process is initiated. It involves 3 steps, Extraction (Sensitive data is queried from the source data store and stored in memory), Masking (Based on the configuration the masking process applies the rules on the sensitive data) and Loading (The masked data is loaded back into target data store).
Data Masking Algorithms
Algorithms are the rules that will be used to transform the source data into appropriate masked data. Following are the common types of algorithms:
Substitution Algorithms
- These algorithms replace the original value with a different value that is computed by the algorithm.
- The algorithm based on a key generates the output based on the input value provided.
- The algorithm generates a numerical value for a numeric input and alphabet value for a alphabetic input.
- These algorithms work well for fields like numbers, alpha-numeric values (PAN card).
Lookup Algorithms
- These algorithms rely on a dummy data that it uses to replace the original value in the given field.
- Data masking tools contain dummy data for Names, Addresses etc.
- The original value (John Smith) is replaced with a one of the dummy values that is randomly selected (Peter Parker).
- These algorithms work well for personal information like Names, Company Names, Addresses
Shuffle Algorithms
- These algorithms do not use either lookup or substitution algorithms.
- They shuffle the original values into different rows within the same table
- This is used for data like states and countries in which just by shuffling the masking is achieved.
For more about Data Obfuscation/Data Masking
- Wiki : www.en.wikipedia.org/wiki/Data_masking
- Techopedia : techopedia.com/definition/25015/data-obfuscation-do
- Informatica : informatica.com/data-obfuscation-definition.html
- Difference : differencebetween.info/difference
- searchsecurity : searchsecurity.techtarget.com/definition/data-masking