Securing Hadoop Data – What Are Your Options?
This blog was originally posted on Thursday March 12, 2015 by MapR guest writer Carole Murphy.
Securing data is a problem, not for a subset of people, but in fact for everyone dealing with sensitive data—from executives and business stakeholders to data scientists and developers. A data security strategy must be defined up front, not later, as part of the roadmap for integrating Hadoop into the enterprise information ecosystem.
Hadoop, like any other platform brought into the data center, will likely hold sensitive data subject to privacy and security regulations and audits. Hadoop brings many benefits to the enterprise, but is also changing the cyber-attack landscape. Attackers are constantly looking for which systems to target, and Hadoop becomes a starting point with all the data stored in it. Following are some technology options to consider in developing a Hadoop data security strategy.
Smaller or early-stage Hadoop deployments may be intended for unstructured data or at least for data where the structure has not yet been determined. Even though the initial use case might be around documents that do not contain sensitive data (yet) it usually quickly expands to documents and data with sensitive information in the clear. Determining a strategy should involve prioritizing around:
- Structured data (fields in data streams, feeds or transaction flows)
- Semi structured data (fields in files, often batch in nature)
- Unstructured data (binary objects, scans, files and documents)
For cases where the structure is unknown in advance, volume level and/or HDFS file/folder level encryption should be general practice for any Hadoop deployment – for basic compliance and data-at-rest security. With this level of protection, access controls can be maintained on the data and who can see the data in the clear, and it is effective for all data types.
But volume level encryption protects the data only on the disk drive itself and does not prevent unauthorized access to the data from the system. Decryption of the data is applied automatically when the data is read by the operating system and live data is fully exposed to any user whether they are authorized or not. Every read and write invokes encryption and decryption, which adds overhead, even if the data being written or read, is not sensitive.
In the case of HDFS file/folder level encryption, it provides more granular protection against unauthorized access by individuals but the data itself is decrypted, or encrypted, every time a read or write operation takes place accessing the data. Unlike volume level encryption above, the overhead is even higher as the encryption/decryption is taking place not at the operating system level but at the application level.
Transport-level encryption is another common technique used to protect data while being transferred across a network, and is sometimes called “data-in-motion” or “wire” encryption. SSL/TLS protocols are an example of this. It is an important data security technique, as data moving “across the wire” is often most vulnerable to interception and theft, whether transferring data across clusters in a rack within the datacenter, nodes in the cloud, or especially across a public network (i.e. the Internet).
However, the protection this method affords is limited to the actual transfer itself. Clear, unprotected data exists at the source, and clear, unprotected data is automatically reassembled at the destination. This provides many vulnerable points of access for the potential thief, and like storage-level volume encryption above, does not protect the data from unauthorized users.
Data-Centric Security – for Data-at-Rest, In Motion, and in Use
A data-centric security approach is quite different than the above techniques. With data-centric security sensitive field-level data elements are replaced with usable, yet de-identified, equivalents that retain their format, behavior and meaning. This means you modify only the sensitive data elements so they are no longer real values, and thus are no longer sensitive, but they still look like legitimate data. This format-preserving approach can be used with both structured and semi-structured data. This is also called “end-to-end data protection” and provides an enterprise-wide solution for data protection that extends into Hadoop and beyond the Hadoop environment. This protected form of the data can then be used in subsequent applications, analytic engines, data transfers and data stores. A major benefit: a majority of analytics can be performed on de-identified data protected with data-centric techniques–data scientists do not need access to live payment card, protected health or personally identifiable information in order to achieve the needed business insights.
HP SecureData supports data-centric methods for data protection:
- HP Format-Preserving Encryption (FPE) and HP Secure Stateless Tokenization (SST) secure both structured data (fields in data streams, feeds or transaction flows) and semi-structured data (fields in files, often batch in nature).
- Both HP FPE and HP SST use HP Stateless Key Management, which enables keys for encryption and decryption to be securely derived on the fly after authentication. As a result, keys do not need to be persisted at end-points. Datacenters can be equipped with Key Servers enabling automated failover, and avoiding data loss, e-discovery and other complexities of traditional key management.
HP Security Voltage integrations with Sqoop, UDFs, MapReduce and Hive enable data-centric security in Hadoop. These data-centric technologies provide granular policy controlled field-level encryption and tokenization from a single data protection platform with unified key management and the ability to use existing Identity and Access Management (IAM) systems including AD, and LDAP.
HP Security Voltage is a certified technology partner with MapR.
Global Product Marketing, HP Security Voltage