Using relational data stores with Hadoop —— Hive & Sqoop

HIVE

We can actually make your Hadoop cluster look like a relational database through a technology called Hive. And there's also ways of integrating a new Hadoop cluster with a MySQL database.

1579064046135

What is Hive?

It lets you write standard SQL queries that look just like you'd be using them on MySQL, but actually execute them on data that's stored across your entire cluster, maybe on an HDFS cluster as well.

1579064255091

Why Hive

1579082214734

Why not Hive

1579083363568

It's not really meant for being hit with tons of queries all at once, from a website or something like that. That's where you use something like HBase instead.

Hive is a bunch of smoke and mirrors to make it look like a database, so you can issue SQL queries on it, but it isn't really.

HiveQL

1579084456353

https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/learn/lecture/5963170#overview

1579085359675

或写成:

1
2
3
4
SELECT title, COUNT(movieID) AS ratingCount
FROM ratings JOIN names ON ratings.movieID = names.movieID
GROUP BY ratings.movieID
ORDER BY ratingCount DESC;
1579085980578

We used views in Hive to split up a more complicated query into something more readable and more manageable.

执行时会花一点时间,remember this isn't really a relational database, so things like joins are kind of painful when you're not dealing with normalized data.

执行后会发现新创建的view出现在了Database里:

1579085692727

Your views are persistent, they're stored to disk.

若此时再执行会出错,可写成: CREATE VIEW IF NOT EXISTS topMovieIDs ...

Just to clean up our mess and not leave that around: DROP VIEW topMovieIDs;

How Hive works?

1579086892876

Schema on read

1579087836551

One of the basic concepts pf Hive is something called schema on read, and this is what separates it from a more traditional database. With a real relational database, it uses something called schema on write where you define the schema of your database before you load the data into it, and it's actually enforced at the time that you write the data to disk.

Hive flips that on its head. It takes unstructured data and just sort of applies a schema to it as it's being read instead.

So, Hive isn't creating some structured relational database under the hood, that would be very inefficient, it's just taking the existing data that you have on your cluster and imparting a schema to it when it's read in.

Where is the data? (内部表与外部表)

1579089471741

对于 managed table, if you do a "DROP TABLE" command from Hive, then that data is gone.

But sometimes you want to share that data with other systems that are outside of Hive. So that's where external tables come in. "CREATE EXTERNAL TABLE" with "Location" says, "I'm going to use Hive on this data here, but I'm not going to own it anymore". So I drop this table, it's going to drop the metadata, 但是原数据不会被删除。

Partitioning 分区

If you do have a massive dataset and your queries tend to be on a specific partition of that data, partitioning in Hive can be a very important optimization.

1579096514448

We wouldn't have to scan over the entire customer database, we could just scan over the actual files that are specific to the country that I'm interested in.

Ways to use Hive

1579096909189

[Activity] Use Hive to find the movie with the highest average rating

1579149719149
1579097528758

或写成:?

1
2
3
4
5
SELECT title, AVG(rating) as avgRating
FROM ratings JOIN names on ratings.movieID = names.movieID
GROUP BY ratings.movieID
HAVING COUNT(ratings.movieID) > 10
ORDER BY avgRating DESC;

Sqoop

Integrating MySQL with Hadoop

1579149997458
1579150738331
1579150774304

Import/Export data (MySQL <-> HDFS/Hadoop)

1579151204058

Export data from your database into Hadoop:

  1. kick off a bunch of mappers (只有mappers, all we're doing is moving data and transforming it from one place to another, there's no real reduce going on here.)
  2. all these mappers are going to talk to your HDFS cluster on your Hadoop cluster and populate a big old table on HDFS, which is a giant text file (例如逗号分隔的数据), and from there, you can use tools like Hive or Pig on it.

REMEMBER, the power of HDFS is that file might be distributed across many different hosts, blocks, and also stored redundantly. We've done more than just dump our database to a file here - we've dumped it into HDFS, which opens up a whole world of possibilities for analyzing it in a robust, scalable and a manner that can be resilient to failure.

1579156422401
1579157592193

与导入HDFS一样,导入Hive同样会kick off a MapReduce job.

1579154105803
1579159508995

Play with MySQL and Sqoop

1579154231118

[Activity] Install MySQL and import our movie data

https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/learn/lecture/5963236#overview

[Activity] Use Sqoop to import data from MySQL to HDFS/Hive

https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/learn/lecture/5963266#overview

由 MySQL 导入 HDFS 后:

1579156905400

由 MySQL 导入Hive后:

1579157822360

注:These sorts of tools that we're using with Hadoop are really intended for big datasets, so there's a lot of overhead for doing what's actually a pretty simple operation. It doesn't make sense if you're using small datasets. 如果数据集很小,在一个host就能装下,没必要用这些工具,用MySQL就足够了。

[Activity] Use Sqoop tp export data from Hadoop to MySQL

https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/learn/lecture/5963272#overview

  1. 首先需要知道 where the data in Hive actually resides. REMEMBER, Hive is just a schema-on-read sort of deal. The actual data itself if just stored as a plain old text somewhere, and all Hive is doing is imparting structure to it when it's being read.

On Hortonworks:

1579158771230
1579158823550

不是靠逗号分隔的,Hive actually uses low ASCII values, like 1 and 2 and 3, to delimit its data.

  1. 在导入前,MySQL中要先有 target table
1579159104056
1579159601163

和import一样,export也会kick off a bunch of mappers using MapReduce.

什么时候需要从Hive导入MySQL: From a practical standpoint, if you need to expose your data to a database that's more well-suited for OLTP. Sometimes you'll be doing some huge operation using Hive or some other tool on your cluster, but the output of that operation might be small enough to fit on a single database.

结语

We talked about making your Hadoop cluster look like a MySQL database or a SQL database, using Hive.

And we also talked about using your Hadoop cluster with a real MySQL database as well. Kind of two different directions there.

Next, let's talk about NoSQL. Sometimes you actually want to expose your data in a way that's more amenable to real-time queries, and there's a bunch of ways to do that and also ways of integrating Hadoop with systems that are more real-time in nature.