How to Load Unstructured Data into Apache Hive

Aniket Rajput

Data Engineering

Tags:

Apache Hive

Apache Hadoop

In today’s world, a lot of data is being generated daily. To process data that is large and very complex, traditional tools can’t be used. Huge volumes of complex data is simply called Big Data. Converting this raw data into meaningful insights, organizations can make better decisions with their products. We need a dedicated tool to help this raw data to be converted into meaningful data or knowledge. Thankfully, there are certain tools that can help.

Hadoop is one of the most popular frameworks used to process and store Big Data. Hive, in turn, is a tool that is designed to be used alongside Hadoop. In the blog, we are going to discuss the different ways we can load semi-structured and unstructured data into Hive. We will also be discussing what Hive is and how it works. How does the performance of Hive differ from working with structured vs. semi-structured vs. unstructured data?

What is Hive?

Hive is a data warehousing infrastructure tool developed on top of the Hadoop Distributed File System(HDFS). Hive can be used on top of any DFS.) Hive uses Hive query language(HQL), which is very much similar to structured query language(SQL). If you are familiar with SQL, then it is much easier to get started with HQL.

It is used for data querying and analysis over a large amount of data distributed over the Hadoop Distributed File System(HDFS). Hive supports reading, writing, and managing a large amount of data that is residing in the Hadoop Distributed File System(HDFS). Hive is mostly used for structured data but in this blog, we will see how we can load unstructured data.

Initially, Hive was developed at Facebook(Meta), and later it became an open-source project of Apache Software Foundation.

How does Hive work?

Source - AnalyticsVidya

Hive was created to allow non-programmers familiar with SQL to work with large datasets, using an HQL interface that is similar to SQL interface. Traditional databases are designed for small or medium datasets and not large ones. But Hive uses a distributed file system and batch processing to process large datasets very efficiently.

Hive transforms HQL queries into one or more Map-Reduce jobs or Tez jobs, and then these jobs run on Hadoop’s scheduler, YARN. Basically, HQL is an abstraction over Map-Reduce programs. After the execution of the job/query, the resulting data is stored in HDFS.

What is SerDe in Hive?

SerDe is short for "Serializer and Deserializer" in Hive. It's going to be an important topic for this blog. So, you should have a basic understanding of what SerDe is and how it works.

If not, don't worry, first of all, we will understand what Serialization and Deserialization is. When an object is converted into a byte stream, it’s into a binary format so that the object can be transmitted over a network or written into persistent storage like HDFS. This process of converting data objects into byte streams is called Serialization.

Now, we can transmit data objects or write data objects into persistent storage. But how can we receive transmitted data over the network again in a meaningful way because we will not be able to understand binary data properly? So, the process of converting byte stream or binary data back into objects is called Deserialization.

In Hive, tables are converted into row objects and row objects are written into HDFS using a Built-in Hive Serializer. And these row objects are converted back into tables using a Built-in Hive Deserializer.

Built-in SerDes:

• Avro (Hive 0.9.1 and later)

• ORC (Hive 0.11 and later)

• RegEx

• Thrift

• Parquet (Hive 0.13 and later)

• CSV (Hive 0.14 and later)

• JsonSerDe (Hive 0.12 and later in hcatalog-core)

Now, suppose we have data in a format that Hive’s Built-in SerDe can’t process. In such a scenario, we can write our own custom SerDe. Here, we will discuss how the row is converted into a table and vice versa. But writing your own custom SerDe is a complicated and complex process.

There is another way: we can use RegexSerDe. RegexSerDe uses a regular expression to serialize and deserialize the data using regex. RegexSerDe extracts groups as columns. Here, group means regular expressions capturing groups. In a regular expression, capturing groups are a way to treat multiple characters as a single unit. Groups can be created using placing parentheses. For example, the regular expression “(velotio)” creates a single group containing the characters “v,” “e,” “l,” “o,” ”t,” “i,” and “o.”

This is just an overview of SerDe in Hive, but you can deep dive into SerDe. Also, the following image shows the flow of How Hive reads and writes records.

Source: Dummies.com

Types of data :

Big data can be classified in three ways

Structured data:

The data that can be organized into a well-defined structure is called Structured Data. Structured data can be easily stored, read, or transferred in the same defined structure. The best example of structured data is the table stored in Relational Databases. Tables have columns and rows that define a well-organized and fixed structure to data. Another example of structured data is an Excel file. An Excel file also has rows and columns that define a proper structure to data.

Source: O'Reilly

Semi-structured data:

The data that can not be organized into a fixed structure like a table but can be represented with properties such as tags, metadata, or other markers that separate data fields are called semi-structured data. Examples of semi-structured data are JSON and XML files. JSON files contain “key” and “values” pairs, where the key is a tag, and the value is actual data to be stored.

Source: Software Testing Help

Unstructured data:

The data which can not be organized into any structure is called unstructured data. The social media messages fall under the unstructured data category as they can not be organized into either a fixed structure like a table or even with tags or markers that will separate data fields. More examples of unstructured data are text files, multimedia content like images and videos.

Source: Fluxicon

Performance impact of working with structured vs, semi-structured vs, unstructured data

Storage:

Structured data is always stored in RDBMS. Structured data have a high organization level among all three. Semi-structured data has no schema but has some properties or tags. Structured data have less organization level compared to structured data but higher organization level than unstructured data. While unstructured data has no schema, so it has the lowest organization level.

Data manipulation:

Data manipulation includes updating and deleting data. Consider an example where we want to update the name of a student using his roll number. Data manipulation in structured data is easy to perform as we have defined structure and we can manipulate specific records very easily. We can easily update the student's name using his roll number in structured data. Whereas in unstructured data, there is no schema available, so it is not easy to manipulate data in unstructured data as compared to structured data.

Searching of data:

Searching for particular data in structured data is easy compared to searching for data in unstructured data. In unstructured data, we will need to go through all lines, and each word and searching of data in unstructured data will get complex. Searching data in semi-structured data is also easy as we just need to specify the key to get the data.

Scaling of data:

Scaling structured data is very hard. It can be scaled vertically by adding an existing machine’s RAM or CPU, but scaling it horizontally is hard to do. However, scaling semi-structured data and unstructured data is easy.

Data sets we are using:

1) Video_Games_5.json

This dataset contains product item reviews and metadata from Amazon, including 142.8 million audits spreading over May 1996 - July 2014. Reviews (ratings, text, helpfulness votes), item metadata (depictions, category information, price, brand, and image features), and links (also viewed/also bought graphs) are all included in this dataset. The dataset represents data in a semi-structured manner.

The following image shows one record of the entire dataset.

	{
	"reviewerID": "A2HD75EMZR8QLN",
	"asin": "0700099867",
	"reviewerName": "123",
	"helpful": [
	8,
	12
	],
	"reviewText": "Installing the game was a struggle (because of games for windows live bugs).Some championship races and cars can only be \"unlocked\" by buying them as an addon to the game. I paid nearly 30 dollars when the game was new. I don't like the idea that I have to keep paying to keep playing.I noticed no improvement in the physics or graphics compared to Dirt 2.I tossed it in the garbage and vowed never to buy another codemasters game. I'm really tired of arcade style rally/racing games anyway.I'll continue to get my fix from Richard Burns Rally, and you should to. :)http://www.amazon.com/Richard-Burns-Rally-PC/dp/B000C97156/ref=sr_1_1?ie=UTF8&qid;=1341886844&sr;=8-1&keywords;=richard+burns+rallyThank you for reading my review! If you enjoyed it, be sure to rate it as helpful.",
	"overall": 1,
	"summary": "Pay to unlock content? I don't think so.",
	"unixReviewTime": 1341792000,
	"reviewTime": "07 9, 2012"}

view raw video_games_dataset.json hosted with ❤ by GitHub

2) sparkLog.txt

Apache Spark (https://spark.apache.org) is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing. Currently, Spark has been widely deployed in the industry.

The Dataset / Log Set was collected by aggregating logs from the Spark system in a lab at CUHK, which comprises a total of 32 machines. The logs are aggregated at the machine level. Logs are provided as-is without further modification or labeling, which involve both normal and abnormal application runs.

The dataset represents data in an un-structured manner.

	17/06/09 20:10:40 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
	17/06/09 20:10:40 INFO spark.SecurityManager: Changing view acls to: yarn,curi
	17/06/09 20:10:40 INFO spark.SecurityManager: Changing modify acls to: yarn,curi
	17/06/09 20:10:40 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, curi); users with modify permissions: Set(yarn, curi)
	17/06/09 20:10:41 INFO spark.SecurityManager: Changing view acls to: yarn,curi
	17/06/09 20:10:41 INFO spark.SecurityManager: Changing modify acls to: yarn,curi
	17/06/09 20:10:41 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, curi); users with modify permissions: Set(yarn, curi)
	17/06/09 20:10:41 INFO slf4j.Slf4jLogger: Slf4jLogger started
	17/06/09 20:10:41 INFO Remoting: Starting remoting
	17/06/09 20:10:41 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@mesos-slave-07:55904]

view raw spark_log_dataset.txt hosted with ❤ by GitHub

Now, we will look into different ways of loading unstructured data into Hive.

How to load semi-structured data into Hive?

1) Using Spark

If you are aware of Spark, loading semi-structured data into the spark is very easy. Spark can read JSON files, XML files, and convert them into Spark DataFrame. In Spark, DataFrame is a distributed collection of data that is organized into columns and rows. It is logically similar to tables in relational databases.

Now, we have our semi-structured data in an organized way. We can now write this organized DataFrame into Hive as a table from Spark.

Below is the code to read the JSON file and write it as a table in Hive.

‍

	from pyspark.sql import SparkSession

	## creating sparkSession to get entrypoint to spark application
	sparkSession = SparkSession\
	.builder\
	.appName('Write_table_to_hive')\
	.enableHiveSupport()\
	.getOrCreate()

	## reading data from dataset "Video_Games_5.json"
	GamesReviewDataFrame = sparkSession.read.format("json") \
	.format("json") \
	.option("path", "/home/velotio/Downloads/UnstructuredData/Video_Games_5.json")\
	.load()

	## we can modify data the way we want to represent in table here
	GamesReviewDataFrame.show()

	## writing dataframe "GamesReviewDataFrame" as a table in HIVE.
	GamesReviewDataFrame.write.saveAsTable("GameReviewTable")

	sparkSession.stop()

view raw read_from_hive_program.py hosted with ❤ by GitHub

Output for above code:

As you can see in the output, a few records of the DataFrame are displayed in an organized table.

2) Using built-in SerDe, JSON SerDe

Hive provides us with a few built-in SerDe. Using this built-in SerDe, we can load data into Hive. In our case, we have used the Video_Games_5.json file as a dataset for semi-structured data, which is a JSON file. So, we will be using built-in JsonSerDe to load Video_Games_5.json data into Hive. This JsonSerDe can be used to read data in JSON format.

We will need to add JsonSerDe.jar to Hive.

You can download JsonSerDe here.

1) Copy dataset Video_Games_5.json from the local file system to the docker container.

To load data into the Hive table, we need to copy the dataset Video_Games_5.json into HDFS. As we are running HDFS and Hive in the docker container, we will need to copy this dataset from the Local File System to the docker container.

docker cp /home/velotio/Downloads/UnstructuredData/Video_Games_5.json 0fde53f41006:/aniket

view raw json_serde_step_1.txt hosted with ❤ by GitHub

2) Copy dataset Video_Games_5.json from a docker container to the HDFS file system.

	ls
	ls /aniket
	hdfs dfs -ls /
	hdfs dfs -ls /aniket
	hdfs dfs -put /aniket/Video_Games_5.json /aniket
	hdfs dfs -ls /aniket

view raw json_serde_step_2.txt hosted with ❤ by GitHub

3) Copy json-serde.jar from the local file system to the docker container

To use JsonSerDe, add the json-serde.jar file to Hive so that Hive can use it.

We store this json-serde.jar file to HDFS storage where our dataset is also present. As Hive is running on top of HDFS, we can access the HDFS path from Hive. But to store json-serde.jar on HDFS, the file needs to be present in the docker container. For that, we copy json-serde.jar to the docker container first.

docker cp /home/velotio/Downloads/json-serde-1.3.7.3.jar 0fde53f41006:/aniket

view raw json_serde_step_3.txt hosted with ❤ by GitHub

4) Copy json-serde.jar from a docker container to the HDFS file system.

	ls /aniket
	hdfs dfs -ls /aniket

	hdfs dfs -put /aniket/json-serde-1.3.7.3.jar /aniket
	hdfs dfs -ls /aniket

view raw json_serde_step_4.txt hosted with ❤ by GitHub

5) Add json-serde.jar file to Hive

ADD JAR hdfs:///aniket/json-serde-1.3.7.3.jar;

view raw json_serde_step_5.txt hosted with ❤ by GitHub

6) Create Hive table GameReviews

To load data into Hive and define the structure of our data, we must create a table in Hive before loading the data. The table holds the data in an organized manner.

While creating the table, we are specifying “row format serde,” which tells Hive to use the provided SerDe for reading and writing Hive data.

	create table GameReviews(
	reviewerID string,
	asin string,
	reviewerName string,
	helpful array<int>,
	reviewText string,
	overall int,
	summary string,
	unixReviewTime int,
	reviewTime string
	)
	row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
	stored as textfile;

view raw json_serde_step_6.txt hosted with ❤ by GitHub

‍

7) Load data from the Video_Games_5.json dataset into the table.

We are loading data from Video_Games_5.json into the Hive table. With the help of SerDe provided while creating a table, Hive will parse this data and load it into the table.

load data inpath '/aniket/Video_Games_5.json' into table GameReviews;

view raw json_serde_step_7.txt hosted with ❤ by GitHub

8) Check data from the table.

Just cross-check if the data is loaded properly into the table.

select reviewerID,asin,reviewerName,overall from GameReviews limit 10;

view raw json_serde_step_8.txt> hosted with ❤ by GitHub

How to load unstructured data into Hive ?

1. Using Regex SerDe

For unstructured data, the built-in SerDe can’t work with excluded RegxSerDe. To load unstructured data into Hive, we can use RegexSerde. First of all, we will need to figure out what unstructured data is useful. After knowing what data is useful, we can extract data using pattern matching. For that, we can use regular expressions. With regular expressions, we will load unstructured data of the SparkLog.txt dataset into Hive.

In our case, we are going to use the following regular expression:

"([0-9]{2}/[0-9]{2}/[0-9]{2}) ([0-9]{2}:[0-9]{2}:[0-9]{2}) [a-zA-Z]* ([a-zA-Z0-9.]*): (.*)$"

“([0-9]{2}/[0-9]{2}/[0-9]{2})”: First group in Regular Expression matches date values.

“([0-9]{2}:[0-9]{2}:[0-9]{2})”: Second Group in Regular Expression matches timestamp values.

“[a-zA-Z]*”: This pattern matches any string with multiple occurrences of char a to z and A to Z; this pattern will be ignored in the Hive table as we are not collecting this pattern as a group.

“([a-zA-Z0-9.]*):”: Third group in regular expression matches with multiple occurrences of char a to z, A to Z, 0 to 9 and “.”

“(.*)$”: Fourth and last group matches with all characters in the remaining string.

1) Copy dataset sparkLog.txt from local file system to docker container.

To load data into Hive table, we need dataset sparkLog.txt into HDFS. As we are running HDFS and Hive in the docker container, we will need to copy this dataset from the Local File System to the docker container first.

docker cp /home/velotio/Downloads/UnstructuredData/sparkLog.txt 6d94029a1f34:/aniket

view raw regx_serde_step_1.txt hosted with ❤ by GitHub

2) Copy dataset sparkLog.txt from a docker container to HDFS file system.

	ls
	ls /aniket
	hdfs dfs -ls /
	hdfs dfs -ls /aniket
	hdfs dfs -put /aniket/sparkLog.txt /aniket
	hdfs dfs -ls /aniket

view raw regx_serde_step_2.txt hosted with ❤ by GitHub

3) Create a Hive table sparkLog.

To load data into Hive and define the structure to our data, we must create a table in Hive before loading the data. The table holds the data in an organized manner.

While creating the table, we are specifying “row format SerDe,” which tells Hive to use the provided SerDe for reading and writing Hive data. For RegexSerDe, we must specify serdeproperties: "input.regex" and "output.format.string."

	create table sparkLog
	(
	datedata string
	,time string
	,component string
	,action string
	)
	row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
	with serdeproperties
	(
	"input.regex" = "([0-9]{2}/[0-9]{2}/[0-9]{2}) ([0-9]{2}:[0-9]{2}:[0-9]{2}) [a-zA-Z]* ([a-zA-Z0-9.]): (.)$",
	"output.format.string" = "%1$s %2$s %3$s %4$s"
	)
	STORED AS TEXTFILE;

view raw regx_serde_step_3.txt hosted with ❤ by GitHub

4) Load data from the sparkLog.txt dataset into the table.

We are loading data from sparkLog.txt into the Hive table. With the help of the SerDe provided while creating the table, Hive will parse this data and will load it into the table.

load data inpath '/aniket/sparkLog.txt’ into table sparkLog;

view raw regx_serde_step_4.txt hosted with ❤ by GitHub

5) Check the data from the table.

Cross-check if the data is loaded properly into the table.

select * from sparkLog;

view raw regx_serde_step_5.txt hosted with ❤ by GitHub

2) Using HQL functions

For unstructured data, we have already seen how to use RegexSerDe to load unstructured data into Hive. But what if I am not aware of regular expressions or can’t write complex regular expressions to match patterns in a string? There is another way to load unstructured data into Hive using some HQL user-defined functions.

What we need to do is create a dummy table and load unstructured data as it is into Hive in just one column in the table named “line.” We are loading unstructured data into a dummy Hive table column named as a line. The first record of the "line" column will contain the first line of DataSet, and the second record of the line column will contain the second line of DataSet. Like this, the entire Dataset will be loaded into a dummy table.

Now, using HQL user-defined functions on the dummy Hive table, we can write specific data to specific columns into the main table using the “insert into” statement. You should be able to extract the data that you want using HQL user-defined functions.

1) Copy dataset sparkLog.txt from local file system to docker container

To load data into Hive table, we need dataset sparkLog.txt into HDFS. As we are running HDFS and Hive in the docker container, we will need to copy this dataset from the local file system to the docker container.

docker cp /home/velotio/Downloads/UnstructuredData/sparkLog.txt 6d94029a1f34:/aniket

view raw hql_step_1.txt hosted with ❤ by GitHub

2) Copy the dataset sparkLog.txt from a docker container to HDFS file system.

	ls
	ls /aniket
	hdfs dfs -ls /
	hdfs dfs -ls /aniket
	hdfs dfs -put /aniket/sparkLog.txt /aniket
	hdfs dfs -ls /aniket

view raw hql_step_2.txt hosted with ❤ by GitHub

3) Create Hive table log

We are creating a dummy Hive table as a log. We are specifying “row format delimited lines terminated by ‘/n’,” which tells Hive to consider default value for fields delimiter and ‘/n’ for line delimiter.

	create table if not exists log
	(
	line string
	)
	row format delimited
	lines terminated by ‘\n’
	STORED AS TEXTFILE;

view raw hql_step_3.txt hosted with ❤ by GitHub

4) Load the data from sparkLog.txt dataset into the table log.

We are loading data from sparkLog.txt into Hive table log.

load data inpath '/aniket/sparkLog.txt’ into table log;

view raw hql_step_4.txt hosted with ❤ by GitHub

5) Create Hive table log sparkLog.

We are creating a Hive table sparkLog to keep our organized data. This organized data will be extracted from a dummy Hive table log.

	create table sparkLog
	(
	datedata string
	,time string
	,component string
	,action string
	)
	row format delimited
	lines terminated by ‘\n’
	STORED AS TEXTFILE;

view raw hql_step_5.txt hosted with ❤ by GitHub

6) Parse the data from log table using a case statement and insert records into the sparkLog table.

We are using HQL user-defined functions to get the specific data and inserting this data into our sparkLog table using insert into statement.

	insert into sparkLog select
	split(line, ' ')[0] as datedata,
	split(line, ' ')[1] as timedata,
	split(split(line, ': ')[0],' ')[3] as component,
	split(line, ': ')[1] as action
	from log ;

view raw hql_step_6.txt hosted with ❤ by GitHub

7) Check data from the table.

Crosscheck if the data is loaded properly into the table.