Taking advantage of HPE Vertica and HPE SecureData
Data-Centric Security with HPE: Vertica and SecureData
Does your organization use the SQL on Hadoop capabilities provided by Vertica? Are you concerned about protecting PII/PHI data when it flows across your Hadoop and Vertica clusters? Are you looking for a security solution that will allow you to meet PCI and other industry compliance and data privacy regulations?
You can take advantage of the recent integration of HPE Vertica and HPE SecureData. This integration allows you to keep your data protected when it flows across multiple systems spread across multiple data centers on premise or in the cloud.
This blog describes how, using Vertica SQL on Hadoop, we tested the Vertica integration with HPE SecureData. We protected the data in the Hadoop layer and then accessed the data in Vertica. The test results are described in detail in this blog.
Before explaining how the integration works, let’s see how big data systems add to the complexities of protecting data.
Big Data Security Challenges
MPP systems like Vertica can store and process huge amounts of data in real time, but that involves hundreds of nodes spread in the same or different data centers.
Security for data at rest ensures that data is protected when it is present on the disk. But security at rest does not protect you when data is flowing across such systems, or when data is in use for analytics and business processes.
Data lakes are becoming very popular these days. Enterprise organizations are trying to store data in a single place like Hadoop HDFS and use Vertica SQL on Hadoop engine capabilities. With Vertica for SQL on Hadoop, you can run advanced SQL queries on data that resides in your Hadoop cluster. . End-to-end data protection comes to the rescue to ensure compliance with industry guidelines.
Let’s see how data-centric security helps you protect your data.
What is Data-Centric Security?
Data-centric security techniques protect data as early as possible—when it is generated or when it is in the landing zone in HDFS—and ensures that it remains protected when accessed with Vertica for SQL on Hadoop.
Hewlett Packard Enterprise offers one data-centric solution—HPE SecureData—which allows you to protect your sensitive data in a data lake like Hadoop, Vertica, and other big data platforms. HPE SecureData is a data-centric framework that protects data end to end, and enables PCI compliance, scope reduction, and secures PII/PHI for data privacy.
- Performs encryption and tokenization using identity-based key management systems to protect your enterprise data across big data systems.
- Preserves the format of the data when it is protected. You are not required to change your applications or schema, as you must with traditional encryption technologies like AES.
- Maintains the referential integrity of the data and allows most of the analytical operations to be performed on the protected data itself.
So how does Vertica take advantage of data-centric security with HPE SecureData?
Achieving Protection in Hadoop and Vertica using HPE SecureData
HPE Secure Data provides developer templates that you can leverage to protect your data at multiple levels in the Hadoop technology stack, for example, MapReduce, Sqoop, or Hive.
Vertica can access the protected data residing on HDFS using its ORC parser and can perform analytics on that data without being copied data into Vertica. Vertica has developed UDFs that make use of HPE SecureData client libraries to encrypt or tokenize or even retrieve the plaintext.
The following figure illustrates how all these systems work together.
For our testing, we used a three-node HDP 2.3 cluster and a three-node Vertica 7.2.x cluster.
We protected sample files containing PII information in Hadoop during this test. Vertica performed most of the queries on the protected data. Vertica also used UDFs to access that protected data if required. To learn how the Vertica UDFs can be configured, see the following Vertica Knowledge Base article: Vertica Integration with Secure Data-Protecting Enterprise Data.
Protection in Hadoop
HPE SecureData provides templates for Hadoop that include sample code that demonstrates how data can be protected in Hadoop. To learn how to configure these templates, refer to the HPE SecureData Hadoop Developer Guide that ships with HPE SecureData.
The plaintext.csv and creditscore.csv files contained the PII information for the customer and the credit score data.
plaintext.csv has entries that contain PII information such as SSNs and credit card numbers, as in the following example:
1,Tyshawn Medhurst,Verl Plaza, New Lianemouth,LA,44638,(405)920-0731,oheidenreich@gma
creditscore.csv has entries like the following example:
As part of the testing, we had to copy these CSV files, along with vsconfig.properties, using copy-sample-data-to-hdfs.sh, which is located in the bin directory in the Hadoop templates. The vsconfig.properties file has information about the HPE SecureData server, simple API, or WS configuration, along with the format information that has to be applied to the sample data.
You can protect data using any of the following, depending on your use case:
Run run-mr-protect-job.sh to do the protection using MapReduce. It performs protection on the sample data stored in HDFS.
hadoop jar voltage-hadoop.jar com.voltage.securedata.hadoop.mapreduce.Protector \ -libjars support-3rdparty-libs.jar,../simpleapi/vibesimplej
ava.jar \ voltage/mr-sample-data \ voltage/protected-sample-data
This job uses the information in the vconfig.properties to apply protection like:
# ssn column mr.field.3.column.index = 11 mr.field.3.format.name = ssn mr.field.3.api = simpleapi mr.field.3.auth.identity = firstname.lastname@example.org mr.field.3.auth.method = SharedSecret mr.field.3.auth.sharedSecret = voltage123
You can run the Sqoop job to import data from operational database sources like MySQL, as in the following example. Make sure to run codegen.sh before running run-sqoop-import.sh:
Run run-sqoop-import.sh: DATABASE_HOST=xxx.xxx.xxx.xxx DATABASE_NAME=voltage TABLE_NAME=voltage_sample DATABASE_USERNAME=root set -x #echo on hdfs dfs -rm -r voltage/protected-sqoop-import sqoop import \ -libjars com.voltage.sqoop.DataRecord.jar,support-3rdparty-
libs.jar,../simpleapi/vibesimplejava.jar \ --username $DATABASE_USERNAME \ -P \ --connect jdbc:mysql://$DATABASE_HOST/$DATABASE_NAME \ --table $TABLE_NAME \ --jar-file voltage-hadoop.jar \ --class-name com.voltage.securedata.hadoop.sqoop.SqoopImportPro tector \ --target-dir voltage/protected-sqoop-import exit $?
This job also uses vsconfig.properties to apply protection:
# ssn column sqoop.field.3.column.name = ssn sqoop.field.3.format.name = ssn sqoop.field.3.api = simpleapi sqoop.field.3.auth.identity = email@example.com sqoop.field.3.auth.method = SharedSecret sqoop.field.3.auth.sharedSecret = voltage123
- Hive UDFs
Run create-hive-table.hql to protect and create Hive tables storing protected data.
# ssn data hive.field.3.alias.name = ssn hive.field.3.format.name = ssn hive.field.3.api = simpleapi hive.field.3.auth.identity = firstname.lastname@example.org hive.field.3.auth.method = SharedSecret hive.field.3.auth.sharedSecret = voltage123
Accessing Data in Vertica Using UDFs
After data has been protected in the Hadoop technology stack, Vertica can access it using the ORC parser. In addition, you can access the Vertica UDFs through any of the connectors. To use the Vertica ORC parser to access the data, you must store the data in the ORC file format:
=> CREATE TABLE voltage_sample_creditscore_orc (ssn string,creditscore string) STORED AS ORC; => INSERT OVERWRITE TABLE ssn STRING, creditscore STRING SELECT * FROM voltage_sample_creditscore; => CREATE EXTERNAL TABLE voltage.voltage_sample_creditscore_orc (ssn VARCHAR, creditscore VARCHAR) AS COPY FROM 'webhdfs://partg9-004:50070/apps/hive/warehouse/vo
ltage_sample_creditscore_orc/*' ON ANY NODE ORC;
Similarly, you create voltage_sample_plaintext_orc for the plaintext data.
Since referential integrity is maintained, you can run your queries to perform joins on the protected data itself:
=> SELECT s.id, s.name, s.email, s.birth_date, s.cc, s.ssn, cs.creditscore FROM voltage_sample_orc s JOIN voltage_sample_creditscore_orc cs ON (s.ssn = cs.ssn)
You can even filter and access the plaintext values in Vertica as follows:
=> SELECT s.id, s.name,s.email, s.birth_date, s.cc, ACCESS(s.ssn USING PARAMETERS FORMAT ='SSN') FROM voltage_sample s WHERE s.ssn = PROTECT('675-03-4941' USING PARAMETERS FORMAT ='SSN');
Now you understand how to leverage Vertica and HPE SecureData to protect your data and achieve data-centric security.
For more information, visit the following sites:
- HPE SecureData Enterprise
- HPE SecureData Suite for Hadoop
- HPE Vertica Home Page
- HPE Vertica Integration with HPE SecureData – Protecting Enterprise Data
- Vertica Documentation on ORC: Reading ORC Files Directly
- Video on ORC and Vertica: HP Vertica 7.1.2 New Features: Reading Apache Hive ORC Files
This post first appeared on the HPE Dev Community/Big Data & Analytics blog and is written by Praveen Jain, Systems Engineer – Big Data/HPE Vertica.