A casual comment overheard at a recent conference struck me:

“Hackers are pretty good at data discovery.”

While hackers might not agree with that statement, in the context of data protection, it’s a trenchant remark. What it means is that hackers are good at finding the valuable data on a system or network.

“Data discovery” is a neologism, and does not yet have a single, well-accepted meaning. In the context of the remark above, it meant “Looking at a bunch of data sources and determining which ones are of interest”. This applies equally to data protection: before you can protect the data, you need to understand what you have and what needs protection.

If the problem to be solved is tightly focused—for example, achieving PCI DSS compliance by protecting payment card information—this may seem fairly straightforward: Find the PANs and protect them! But it may not be that simple:

  • Are there “Comment” fields in customer records, into which reps may have typed card numbers, perhaps in violation of existing procedures? Security cannot count on everyone following the rules—if it was that easy, we wouldn’t need data protection solutions. And if an auditor notices such a field, you may fail the audit.
  • Are there undocumented processes that use the data? Everyone has had the experience of rolling out a change with plenty of notice and feedback, then having the phone ring after deployment and hearing that something is now broken because nobody knew of a dependency.
  • Can you be sure you’ve heard from all affected constituents? This is a variation of the preceding item, differing in that there may be teams who simply ignore the call for comments. This is their fault, but becomes your problem, since you’re the one who “broke it”.

So how do you perform effective data discovery for a data protection project, with reasonable confidence that you’ve covered most of the bases?

The answer is, of course, “it depends”.  Installations with well-documented processes and procedures, long-term staff who know the history of why things work the way they do, and good internal communication will find it relatively easy (such sites must exist, right?). But in far too many installations, production relies on a bunch of critical but poorly documented applications, the people who wrote the code are long gone, and some of the groups don’t get along.

This is a challenge, and the usual principles apply: having executive sponsorship is critical to success. Yes, this sounds like “Do it or I’m telling Mom”, but if that’s how things have to be, then at least it’s a method that works. If possible, getting the laggards on board by demonstrating with a pilot effort that proves the practicality of data protection should help. And offering your support against the “mandate from above” can sometimes turn them into reluctant allies.

The actual data discovery process involves a certain amount of drudgework: examining database schema and data set layouts, then determining which programs use the sensitive fields—lather, rinse, repeat. There are software tools that can help with this process. Which one best fits a given use case is highly dependent on the platforms, languages, and data stores involved. Be careful—simply searching for “data discovery tools” will find a wide variety of products that do very different things.

But back to the hackers: In the context of unauthorized system access, “data discovery” means scanning through the data stores, looking for the needles in the haystack. This doesn’t actually mean finding all the needles, however—and finding them all is your challenge in implementing data protection.

We all know that finding the needle in a haystack is difficult; finding one of a thousand needles, not so much. The hacker’s goal is to find one or a few needles; if (s)he misses the other 997, that’s OK. Your goal of finding all of the needles is more difficult, but you have the advantage of legitimate access to schema, copybooks, and colleagues. Of course, you don’t want to report to management, “Yeah, we found and locked down most of the data”, but if you do miss a needle or two, it’s going to be that much harder for the black hats to find in that haystack.

This is also where Format-Preserving Encryption and Tokenization technologies show another, less-than-obvious benefit: if someone is scanning for PANs, looking for 15- and 16-digit numbers, your format-preserved data will pop right up. Even if it’s then obvious that it’s not “good” data—perhaps because the names are also protected, and are obviously not names—all the false positives will slow down the attack. And if they don’t notice, they’ll scurry off thinking they’ve scored the data gold, only to later realize that it is, after all, just data straw. In either case, you have more of a chance to notice and correct the weakness.