Feeding Data to your Cluster - Kafka & Flume

Kafka

1579416693703

What is streaming?

1579417084705

With streaming technologies such as Kafka, you can actually process new data as it's generated into your cluster, maybe you're gonna save it into HDFS, maybe you'll save it into HBase or some other database or maybe you'll actually process it in real time as it comes in.

Usually when we're talking about Big Data, there's a big flow of it coming in all the time and you want to be dealing with it as it comes instead of storing it up and dealing with it in batches.

1579417930302

Enter Kafka

1579418299281

The good thing is that Kafka because it stores it, consumers can catch up from where they last left off, so it will maintain the point where each consumer left off and allow them to just pick up whenever they want to. So it can publish data in real time to your consumers, but if your consumer goes off line or just wants to catch up from some point in the past, it can do that too.

Kafka architecture

1579419075767

Even though it says app, that doesn't necessarily mean you're going to be developing an app in order to use Kafka, often there are ones you can just use off the shelf and Kafka even comes with some built-in that might serve your purposes.

You can make these pretty fancy systems that are very scalable, very fast, that can transform data and store it, and do whatever you want with it really, as it comes in.

How Kafka scales

1579419261062

Let's play

1579419425921

[Activity] Setting up Kafka, and publishing some data

https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/learn/lecture/5963832#overview

Kafka depends on ZooKeeper to keep track of what topics exists.

1579420886266

in the real world, this would be massive floods of data coming in from massive web server fleets and things. It can handle that and publish it scalably and reliably to wherever it needs to go. So that's really what it's all about, it's a scalable and reliable mechanism for publishing and subscribing to massive data streams.

[Activity] Publishing web logs with Kafka

https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/learn/lecture/5963846#overview

So let's do something a little bit closer to what you might do in reality. Let's use a built-in Kafka connector to actually monitor a file and publish new lines on that file to a given Kafka topic that then get written out to some other file somewhere else.

使用的一些Linux语句:

1579422270057
1579422355086
1579422381621

Flume

1579422490612

What is Flume?

You can really think of Flume as a bonafide part of the Hadoop ecosystem, as opposed to Kafka which is much more general purpose in its design.

1579423582139

Now you might think to yourself "Why can't I just directly transfer data across to my HDFS cluster?" Well, you need some sort of a buffer in between these two things.

Flume Agent

1579423781528
1579425760102

Flume与Kafka的一个区别:It's not really like Kafka where Kafka just stores up data indefinitely and people can pull that data in whenever they want to. Kafka does expire data after some amount of time, but with Flume, basically your sink grabs data from Flume and once it's been grabbed from the channel it gets deleted, so flume does not hang onto your messages or your events any longer than it needs to, as soon as your sync processes it, it throws it away. Kafka is a little bit more easier to set up if you have data going to multiple different places that might be pulling at different rates.

Built-in Source Types

1579425301835

Built-in Sink Types

1579424846813
1579424947616
1579425396680

It's basically a way of smoothing out the traffic between the data coming in from your logs or whatever your source might be and where you're writing it to.

Let's play

[Activity] Set up Flume and publish logs with it

https://www.udemy.com/course/the-ultimate-hands-on-hadoop-tame-your-big-data/learn/lecture/5963868#overview

1579425190618

All you really need to do to get Flume running is write a configuration file that defines what sources channels and sinks are associated with each agent that you want in your Flume setup.

1579425924768
1579427003629

[Activity] Set up Flume to monitor a directory and store its data in HDFS

1579425361515
1579426725240
1579426873190

What ends up happening is that we get new subdirectories being generated for each 10 minute interval that goes by.

1579427060001