What is Data Masking? Techniques, Types and Best Practices (2024)

Michael Cobb

What is data masking?

Data masking is a method of creating a structurally similar but inauthentic version of an organization's data that can be used for purposes such as software testing and user training. The purpose is to protect the actual data while having a functional substitute for occasions when the real data is not required.

Although most organizations have stringent security controls in place to protect production data in storage and in business use, sometimes that same data element has been used for operations that are less secure. The issue is often compounded if these operations are outsourced and the organization has less control over the environment. In the wake of compliance legislation, most organizations are no longer comfortable exposing real data unnecessarily.

Data masking substitutes original values in a data set with randomized data using various data shuffling and manipulation techniques. The obfuscated data maintains the unique characteristics of the original data so that it yields the same results as the original data set.

How does data masking work?

Encryption is the best way to securely store and transfer sensitive data. Unfortunately, encrypted data is difficult to query and analyze. For example, you cannot filter users based on age if their data of birth is encrypted. As a result, organizations need another way to keep data secure and private when it is being used for research, development and testing.

This article is part of

What is data security? The ultimate guide

Which also includes:
The importance of data security in the enterprise
5 data security challenges enterprises face today
How to create a data security policy, with template

Data masking, which is also called data sanitization, keeps sensitive information private by making it unrecognizable but still usable. This lets developers, researchers and analysts use a data set without exposing the data to any risk.

Data masking is different from encryption. Encrypted data can be decrypted and returned to its original state with the correct encryption key. With masked data, there is no algorithm to recover the original values. Masking generates a characteristically accurate but fictitious version of a data set that has zero value to hackers. It also cannot be reverse engineered, and statistical outputs cannot be used to identify individuals.

Like data encryption, not every data field needs to be masked, although some fields must be completely hidden.

What is Data Masking? Techniques, Types and Best Practices (2)

Why is data masking important?

Various data protection standards and regulations require that businesses and other organizations protect personally identifiable information, or PII, and protected health information and keep it confidential. These standards and regulations include the following:

California Consumer Privacy Act (CCPA)
General Data Protection Regulation (GDPR)
Health Insurance Portability and Accountability Act (HIPAA)
Payment Card Industry Data Security Standard (PCI DSS)

These regulations and standards play an important role in establishing appropriate levels of data protection and preventing unauthorized users from accessing the data. However, they also make it challenging for companies that want to analyze or share their data with others. Data masking reduces the risks of sensitive data being exposed and lets enterprises comply with various standards and regulations while handling regulated data.

Data masking techniques

A variety of data management techniques can be used to mask or anonymize PII and other private and sensitive data depending on the data type. These masking methods include the following:

Scrambling

Scrambling randomly reorders alphanumeric characters to obscure the original content. For example, a customer complaint ticket number of 3429871 in a production environment could appear as 8840162 in a test environment after being scrambled. Although scrambling is easy to implement, it only works on certain types of data. Data obfuscated this way is not as secure as other techniques.

Substitution

This technique replaces the original data with another value from a supply of credible values. Lookup tables are often used to provide alternative values to the original, sensitive data. The values must pass rule constraints and preserve the original characteristics of the data.

It is harder to apply substitution than scrambling, but it can be applied to several data types and provides good security. For example, credit card numbers can be substituted with numbers that pass card provider validation rules.

Shuffling

Values within a column, such as user surnames, are shuffled to randomly reorder them. For example, if customer surnames are shuffled, the results look accurate but won't reveal any personal information. However, it is essential that the shuffling masking algorithm is kept secure so it cannot be used to reverse-engineer the data masking process.

Date aging

This method increases or decreases a date field by a specific date range. Again, the range value used must be kept secure.

Variance

A variance is applied to a number or date field. This approach is often used for masking financial and transaction value and date information. The variance algorithm modifies each number or date in a column by a random percentage of its real value. For instance, a column of employees' salaries could have a variance of plus or minus 5% applied to it. This would provide a reasonable disguise for the data while maintaining the range and distribution of salaries within existing limits.

Masking out

Masking out only scrambles part of a value and is commonly applied to credit card numbers where only the last four digits remain visible.

Nullifying

Nullifying replaces the real values in a data column with a null value, completely removing the data from view. Although this sort of deletion is simple to implement, the nullified column cannot be used in queries or analysis. As a result, it can degrade the integrity and quality of the data set for development and testing environments.

What is Data Masking? Techniques, Types and Best Practices (3)

Types of data masking

The process of masking data can be initiated in different ways depending on where and when the data is needed. Various types of data masking include the following:

Static data masking creates a separate masked data set from a production database that can be used in non-production environments, such as research, development and modelling. The masked data values must generate test and analytical results that mirror those of the original data and persist over time to ensure accurate and repeatable results.
Dynamic data masking provides role-based security, particularly in production systems, when users require real data. In response to a request for data, dynamic masking transforms, obscures or blocks access to sensitive information fields in real-time based on the user's role. For example, when a financial call center operator handles a customer query, sensitive fields such as birth date, social security number and credit score will be masked unless the operator has the privileges required to view those fields. By streaming data from the production environment in response to a request, it avoids the need to store the masked data in a separate database. However, if the data is located across multiple systems, masking consistency can be an issue.
On-the-fly data masking lets development teams read and mask a small subset of production data directly into a test environment. Data is masked as it is copied from one environment to another, so it is never present in an unmasked form in the target environment or the target database's transaction log. This approach eliminates delays incurred when a staging environment is used to prepare data. This makes it ideal for continuous software development environments.

Data masking challenges

There are several challenges involved with data masking:

Complicated. Data masking is not a simple, one-step activity. The data must be transformed to eliminate the risk of exposing sensitive information through inference attacks. At the same time, the system must maintain the complexity and unique characteristics of the original unmasked data, such as frequency distribution. This ensures that queries and analysis yield the same results as they would from the original.

Referential integrity. Masked data must maintain referential integrity across systems and databases. This is called deterministic data masking and ensures consistency of the masked output across databases and among servers, ensuring a value is always replaced with the same masked value. For example, the name "Helen" is always replaced by the value "Denise" wherever it occurs in the data being masked to preserve primary and foreign keys and relationships.

This may sound straightforward, but most databases are normalized for performance, and sensitive data gets stored in a variety of tables across one or more databases. Structured text, such as Extensible Markup Language (XML), can contain values that must be obfuscated, so they have to be parsed and masked.

Governance. For data used in software testing, the masking process must ensure data still matches internal rules governing the data attributes, such as zip codes or bank account numbers. If care is not taken here, the application will fail an update test because the data validation checks fail.

To overcome these challenges, database administrators must do a detailed review of the data being masked. Others involved with the data -- developers, testers, data scientists and security teams -- should contribute to the review. Input from these stakeholders ensures the appropriate masking techniques are used, data sources of valid replacement values are generated, referential integrity is maintained across all systems and the masked data maintains the characteristics of the original data.

Data masking best practices

Data masking is an important part of how organizations comply with privacy regulations and still are able to mine data for valuable insights. However, the ever-growing volume of structured and unstructured data that organizations collect has increased the complexity of data masking at scale. To remain compliant, it is essential to follow best practices:

Organize and track. Enterprise data is spread across a variety of technologies and locations in different databases, tables and columns. To ensure the correct data is protected, it's important to locate and classify sensitive data that requires protection.
Consider unstructured data. Images, PDFs, and XML- and text-based files must be protected. Images of items, such as passports, drivers licenses, receipts, checks and contracts, must be replaced with fake alternatives. Optical character recognition can be used to detect and mask sensitive content in files.
Secure. Access given to masked data must comply with security policies related to roles, locations and permissions.
Assess. Test the results of data masking techniques to ensure they provide the correct security level and that query results are comparable to those from the original data.

Data masking use cases

Data masking is used in many industries to drive innovation and improve services. For example, the banking and finance industries use it to develop and test new systems as well as improve fraud detection algorithms.

While data masking is used for different reasons, the main driver is data security and personal data privacy.

Data breach prevention. Data obfuscation techniques are the main defense against data breaches. Even if a network's defenses fail to prevent an attacker exfiltrating sensitive data, if it is correctly masked or encrypted, the attackers won't be able to associate the data with individuals.
Privacy by design. Implementing data masking as part of the application development, testing and analysis lifecycles lets important data be shared both internally and externally while remaining compliant with data privacy regulations.
Role-based access control. Many tasks require access to some, but not all, of a subject's data attributes. Data masking ensures employees can perform their duties without seeing data they are not authorized to view.
Faster, safer test data. Masked data retains the integrity and quality needed for testing without compromising the actual data. Automated data masking can be integrated with existing systems to remove time-consuming manual steps. This way, test data can be secured and released from a production environment quickly and without disruption. Data masking ensures production data is not exposed by more relaxed nonproduction environments or third-party developer access.

What is Data Masking? Techniques, Types and Best Practices (4)

Many database vendors offer features and services that simplify data masking and integrate them with existing systems.

Find out more about keeping data safe beyond data masking in our comprehensive guide to data security.

This was last updated in September 2022

Continue Reading About data masking

Data masking vs. data encryption: How do they differ?

Top 7 types of data security technology

How privacy compliance rules will affect IT security

How to secure data at rest, in use and in motion

What are some tips for storage of sensitive data?

Related Terms

backup storage device: A backup storage device is a hardware component for storing copies of data.Seecompletedefinition
chief privacy officer (CPO): A chief privacy officer (CPO) is a corporate executive charged with developing and implementing policies designed to protect ...Seecompletedefinition
cryptanalysis: Cryptanalysis is the study of ciphertext, ciphers and cryptosystems to understand how they work and to find and improve ...Seecompletedefinition

Dig Deeper on Data security and privacy

How to develop a test data management strategyBy: DavidWeldon
How to balance data access and security in fintech testingBy: MattHeusser
synthetic dataBy: KinzaYasar
Data security as a layer in defense in depth against ransomwareBy: JackPoller

FAQs

What is Data Masking? Techniques, Types and Best Practices? ›

Data masking replaces personal or sensitive information with random values, without any way to reveal the original ones. Data tokenization obfuscates sensitive data by replacing it with a meaningless token, for use in databases or internal systems. The tokenization of data process secures data at rest, and in motion.

Read On ›

What are data masking techniques? ›

Data masking is a way to create a fake, but a realistic version of your organizational data. The goal is to protect sensitive data, while providing a functional alternative when real data is not needed—for example, in user training, sales demos, or software testing.

Discover More Details ›

What are the different types of masking? ›

Static Data Masking (SDM)

It involves creating a copy of an existing data set and using masking techniques to scrub it of all sensitive information. This makes the data shareable without the risk of leaking sensitive information or giving access to unauthorized users.

What is the data masking technique that processes personal data in such a way that the data can no longer? ›

Specifically, the GDPR defines pseudonymization in Article 3, as “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information.” To pseudonymise a data set, the “additional information” must be “kept separately and subject ...

See Details ›

What type of data should be marked for masking? ›

There are many types of data that you can protect using masking, but common data types ripe for data masking include: PII: Personally identifiable information. PHI: Protected health information. PCI-DSS: Payment card information.

Find Out More ›

What is masking explain with using example? ›

Image editing: Masking is often used in image editing applications to select part of an image for editing. For example, you might use a mask to select all the blue pixels in an image, so that you can replace them with another color. Create composite images: Masking can also be used to create composite images.

Tell Me More ›

What are the different types of masking in research? ›

A clinical trial design strategy in which one or more parties involved with the trial, such as the investigator or participant, do not know which participants have been assigned which interventions. Types of masking include none Open label, Single blind masking, and Double blind masking.

Show Me More ›

What is the type mask tool? ›

When you use the Horizontal Type Mask tool or Vertical Type Mask tool, you create a selection in the shape of the type. Type selections appear on the active layer, and can be moved, copied, filled, or stroked just like any other selection.

Explore More ›

What is a masking tool? ›

Masking tools let you apply an adjustment to one specific area of a photo, rather than the entire image.

What is another word for data masking? ›

Data Masking is a method to hide sensitive information.

As people interchangeably use such terms as data anonymization, data de-identification, data scrambling, data scrubbing, and data obfuscation , understanding which term defines which process gets confusing.

Show Me More ›

What are the benefits of data masking? ›

Data masking retains the characteristics and integrity of the original production data and helps organizations minimize data security issues while utilizing data in a non-production environment. This masked data can be used for analytics, training, or testing.

Read The Full Story ›

What is the difference between data hiding and data masking? ›

The difference between hiding and masking is that hidden variables don't appear at all, and masked values are replaced with asterisks in Trace and debug sessions.

See Details ›

What is masking format? ›

A format mask is a character string that can be used to define how data is to be displayed or printed in PxPlus (see PRINT directive). Masks can also be applied to filter data being received from the keyboard (see INPUT directive) or in the conversion/validation of a string (see STR( ) function).

Get More Info Here ›

What does masking include? ›

Masking may involve suppressing certain behaviours we find soothing but that others think are 'weird', such as stimming or intense interests. It can also mean mimicking the behaviour of those around us, such as copying non-verbal behaviours, and developing complex social scripts to get by in social situations.

How to test masked data? ›

Make sure that the result of the operation is still a valid date or time, and that start dates in the original data still occur before end dates in the original data. Names, such as customer names or personal names. Replace characters in names in the original data using a fixed or random substitution scheme.

What is the meaning of masked data? ›

Data masking or data obfuscation is the process of modifying sensitive data in such a way that it is of no or little value to unauthorized intruders while still being usable by software or authorized personnel. Data masking can also be referred as anonymization, or tokenization, depending on different context.

View Details ›

What is the difference between data redaction and data masking? ›

Data Redaction vs Data Masking

While data redaction is the process of removing certain pieces of sensitive or personally identifiable information, data masking is a process in which sensitive and authentic information is replaced with inauthentic information that has the same structure.

What is the difference between data masking and encryption? ›

Masking is clearly more secure than encryption but it renders data useless. Masked data has no value for anyone who intercepts or steals it. This information cannot be used for anything other than to run tests on software in a development state. Hackers don't want or care about masked development data.

Learn More ›