Cryptography for Mere Mortals #4

An occasional feature, Cryptography for Mere Mortals attempts to provide clear, accessible answers to questions about cryptography for those who are not cryptographers or mathematicians.

Q: What do people mean by “Data Masking”?

A: This is a confusing term, because it can mean at least two different things, both related to data protection/privacy:

1. Encrypted or tokenized data that is converted back to plaintext, but with some of the characters “masked” by characters such as “x” or “*”;for example, a Social Security number of “999-88-1234” might be returned as “XXX-XX-1234

2. Production data that is obscured or obfuscated for testing or development

In the first case, the goal is to avoid any possibility that end-users will be able to access the entire value—typically a credit card number, Social Security number, or other account number—while allowing them to see enough of the value to verify it. For example, credit card customer service representatives typically ask the customer to provide the last four digits of their credit card number to make sure they are viewing the correct account record. Those customer service representatives do not need to see the entire number; indeed, allowing them to do so exposes the company to risk, because an unscrupulous employee could conceivably copy those numbers down and use or sell them. The Voltage SecureData Web Services Server provides this form of masking, as does the Voltage SecureData z/Protect product for z/OS.

The second case relates to the fact that while test and development systems need some realistic-looking data to operate upon, such systems are typically somewhat less secure than production systems, so the risk of a data breach is higher. In addition, as with the customer service representative example above, developers and testers should not have access to live data, and for the same reasons.

Further complicating the picture, there are multiple techniques for performing this obfuscation:

1. Shuffling

2. Tables & rules

3. Random values

4. Format-Preserving Encryption

Each of these has different strengths and weaknesses.

Shuffling: Shuffling is useful when the set of possible values is known—for example, the 50 US states. Existing values are replaced by other values from within the set, hopefully in a consistent fashion to maintain referential integrity (e.g., NY always becomes CA). This can be appealing because the obfuscated data strongly resembles real data: names are still recognizable names, etc.

The problem with this approach is, of course, that the set of possible values is often not predictable (addresses, for example). A tool can analyze the existing data and shuffle the current set of values, but when a new value is added, there is then no good way to shuffle it, since the existing set is already mapped.

Tables and rules: This approach is similar to shuffling, in that tables of values are built and rules are applied to replace production values. Again, the resulting data looks and feels “real”, and integrity is maintained since a given value is always replaced by the same value. Because tables of values are maintained, “extra” values can be pre-allocated, solving the “new value” problem.

The downside is that maintaining these tables is cumbersome, requiring configuration decisions, databases, and maintenance as the data grows. It is also worth noting that for audit purposes, test data that is not instantly identifiable as being obfuscated can require additional effort to prove to an auditor that the obfuscation is, in fact, in place and working.

Random values: This involves replacing the production data with randomly generated values. The resulting data is easily identified as not being “real” because values are gibberish (names are not normal names, etc.).

However, unless a database of mappings is built, referred to, and maintained, referential integrity will be lost because a given value in one obfuscated data source will not produce the same value in another obfuscated source. This will cause many or most applications attempting to use that data to fail when they attempt to combine data from disparate sources. If such a database is built, it adds significant ongoing management cost.

Format-Preserving Encryption: Using Format-Preserving Encryption to obfuscate data avoids most of the pitfalls of the other approaches. It guarantees referential integrity, requires no ongoing management of an ever-growing database and ruleset, and provides test data real enough for most or all applications, yet easily identified as not being live data that was somehow leaked into test.

FPE allows continued use of existing index fields, and can be applied to many different data types. Care should be taken to apply appropriate policy in some cases, such as birth/death dates—to avoid, for example, death preceding birth. This restriction applies to all masking technologies, and there are ways to handle such situations.

Obviously Voltage Security, Inc., with the SecureData product family providing Format-Preserving Encryption, believes that this is the best choice for most data obfuscation projects. It is usually trivial to add an encryption step to an existing process that copies data from production to test. In cases where no such process exists, using tables often seems appealing, but it is important to recognize the ongoing management that such a solution requires.

Leave a Reply

Your email address will not be published. Required fields are marked *