How to Secure Data in Hadoop: An Interview with Sudeep Venkatesh

Recently, Sudeep Venkatesh, Vice President of Solution Architecture, HP Security Voltage, and noted expert in data protection solutions, was interviewed by Data Informed Magazine for an article titled, “How to Secure Data in Hadoop.” Data Informed is the leading resource for business and IT professionals looking for news and best practices to plan and implement their data analytics and management strategies.


In the interview Sudeep explained why Hadoop, an open-source software framework for storing and processing big data, has become so popular. The unique architecture of Hadoop enables “organizations to gain new analytic insights and operational efficiencies through the use of multiple-standard, low-cost, high-speed, parallel processing nodes operating on very large sets of data.”

The use of Hadoop accomplishes two goals, massive data storage and faster processing of that data. However, Sudeep pointed out that big data security was not a primary goal in the design of Hadoop, and that it has a reputation of “being difficult to secure.” Sudeep pointed out a variety of reasons for this. Hadoop has its roots in the open-source community, which has resulted in unprecedented flexibility, performance, and scalability in the handling of huge amounts of data – any kind of data – quickly. But because a Hadoop deployment consists of multiple sources of data from multiple enterprise systems in real time, a data lake is created, and “accessed by many different users with varying analytics needs. Organizations face the risk of even further reduced control if Hadoop clusters are deployed in a cloud environment.”

Security Paramount with Big Data

Sudeep pointed out that when big data is used in an enterprise environment, “the importance of security becomes paramount.” The reason for that, Sudeep explains is, “organizations must protect sensitive customer, partner, and internal information and adhere to an ever-increasing set of compliance requirements.”

“There are a number of traditional IT security controls that should be put in place as the basis for securing Hadoop,” says Sudeep, “such as standard perimeter protection of the computing environment and monitoring user and network activity with log management. But even in the most tightly controlled computing environments, infrastructure protection by itself cannot protect an organization from cyberattacks and data breaches.”

Hadoop a Target

Hadoop is a vulnerable target (and an alluring target for hackers and data thieves) and too open to be able to fully protect. Sudeep states that this “presents brand new challenges to data risk management,” and new methods of data protection, “are essential to prevent these potentially huge big data exposures.”

Sudeep points out that traditional data de-identification approaches can be deployed to improve security in the Hadoop environment, such as storage level encryption, traditional field-level encryption, and data masking. However, he cautions, “each of these approaches has limitations.”

Hadoop brings many benefits, but is also changing the cyberattack landscape. Hadoop becomes a starting point for hackers because of all the sensitive and valuable data such as credit cards, social security numbers, bank statements, and health records, stored in it. Sudeep advises, “A data security strategy must be defined up front, not later, as part of the roadmap for integrating Hadoop into the enterprise information ecosystem.”

Determine a Strategy

Determining a strategy should involve making a priority around the following items:

  • Structured data (fields in data streams, feeds or transaction flows)
  • Semi structured data (fields in files, often batch in nature)
  • Unstructured data (binary objects, scans, files and documents)

Data in Hadoop exits in three different forms, data-in-use, data-at-rest, and data-in-motion. “Be aware that data-at-rest protection does not secure data-in-motion, leaving the potential for major compliance and exploitable security gaps. Make sure your security posture includes in-motion protection,” advises Sudeep.

Data-Centric Strategy

For best practices, Sudeep calls for a data-centric security approach, which allows for de-identifying the data as close to its source as possible. “With data-centric security,” explains Sudeep, “sensitive field-level data elements are replaced with usable, yet de-identified, equivalents that retain their format, behavior, and meaning. This means you modify only the sensitive data elements so they are no longer real values – and thus are no longer sensitive – but they still look like legitimate data.”

“This format-preserving approach can be used with both structured and semi-structured data,” continues Sudeep. “This is also called ‘end-to-end data protection’ and provides an enterprise-wide solution for data protection that extends into Hadoop and beyond the Hadoop environment. This protected form of the data can then be used in subsequent applications, analytic engines, data transfers, and data stores.”

De-Identified Data is Protected Data

For true Hadoop security, Sudeep advises, “the best practice is to never allow sensitive information to reach the HDFS (Hadoop distributed file system) designed in its live and vulnerable form. De-identified data in Hadoop is protected data. Even in the event of a data breach, it yields nothing of value, avoiding the penalties and costs such an event otherwise would have triggered.”

Leave a Reply

Your email address will not be published. Required fields are marked *