qypx の blog

HashSet

发表于 2020-01-20 分类于 leetcode

无序列表HashSet是采用hash表来实现的。其中的元素没有按顺序排列，add()、remove()以及contains()等方法都是复杂度为O(1)的方法。
无序列表列表TreeSet是采用树结构实现(红黑树算法)。元素是按顺序进行排列，但是add()、remove()以及contains()等方法都是复杂度为O(log (n))的方法。它还提供了一些方法来处理排序的set，如first(), last(), headSet(), tailSet()等等

https://leetcode-cn.com/problems/longest-consecutive-sequence/solution/zui-chang-lian-xu-xu-lie-by-leetcode/

Feeding Data to your Cluster - Kafka & Flume

发表于 2020-01-19 分类于 Hadoop

Kafka

What is streaming?

With streaming technologies such as Kafka, you can actually process new data as it's generated into your cluster, maybe you're gonna save it into HDFS, maybe you'll save it into HBase or some other database or maybe you'll actually process it in real time as it comes in.

Usually when we're talking about Big Data, there's a big flow of it coming in all the time and you want to be dealing with it as it comes instead of storing it up and dealing with it in batches.

Enter Kafka

The good thing is that Kafka because it stores it, consumers can catch up from where they last left off, so it will maintain the point where each consumer left off and allow them to just pick up whenever they want to. So it can publish data in real time to your consumers, but if your consumer goes off line or just wants to catch up from some point in the past, it can do that too.

阅读全文 »

Using non-relational data stores with Hadoop - NoSQL & HBase & MongoDB

发表于 2020-01-16 更新于 2020-01-19

Why NoSQL?

We've talked about integrating Hadoop with SQL solutions - MySQL - RDBMSs, if you will - relational database management systems - and those are very handy for giving you the power of a rich analytical query language like SQL to answer your business questions. But, you know, they do take a little bit of time to execute. So if you're doing analytic work, relational databases are awesome. Even if you're running a small, like, say, an internal web site or a very small-scale web site, something like MySQL can even vend that data to the outside world pretty well.

BUT let's imagine you need to take things up to the next level. You're going to start to run into some limitations with SQL and relational database systems.

Maybe you don't really need the ability to issue arbitrary queries across your entire dataset. Maybe all you need is just the ability to very quickly answer a specific question like "What movie should I recommend for this customer?" or "What web pages has this customer looked at in the past?"

And if you need to do that at a very large scale very quickly across a massive dataset, something like MySQL might not cut it. You know, if you're an Amazon or a Google, you might need something that can even handle tens of thousands of transactions per second without breaking a sweat. And that's where NoSQL comes in.

These are alternative database systems that give up a rich query language like SQL for the ability to very quickly and at great scale answer very simple questions. So for systems like that you want something called NoSQL, also known as non-relational databases, or not only SQL - that's a term that comes up sometimes, too. And these systems are built to scale horizontally forever, and also built to be very fast and very resilient.

Up first, let's talk about HBase. HBase is actually built on top of HDFS, so it allows you to have a very fast, very scalable transactional system to query your data that's stored on a horizontally partitioned HDFS file system. So if you need to expose your massive data that's sitting on your Hadoop cluster, Hbase can be a great way to expose that data to a web service, to web applications, anything that needs to operate very quickly and at a very high scale.

阅读全文 »

Using relational data stores with Hadoop —— Hive & Sqoop

发表于 2020-01-15 更新于 2020-01-16 分类于 Hadoop

HIVE

We can actually make your Hadoop cluster look like a relational database through a technology called Hive. And there's also ways of integrating a new Hadoop cluster with a MySQL database.

What is Hive?

It lets you write standard SQL queries that look just like you'd be using them on MySQL, but actually execute them on data that's stored across your entire cluster, maybe on an HDFS cluster as well.

Why Hive

Why not Hive

It's not really meant for being hit with tons of queries all at once, from a website or something like that. That's where you use something like HBase instead.

Hive is a bunch of smoke and mirrors to make it look like a database, so you can issue SQL queries on it, but it isn't really.

阅读全文 »

Hadoop Course Overview

发表于 2020-01-14 分类于 Hadoop

课程：

The Ultimate Hands-On Hadoop - Tame your Big Data!

Course Materials

What you’ll learn

Design distributed systems that manage "big data" using Hadoop and related technologies.
Use HDFS and MapReduce for storing and analyzing data at scale.
Use Pig and Spark to create scripts to process data on a Hadoop cluster in more complex ways.
Analyze relational data using Hive and MySQL
Analyze non-relational data using HBase, Cassandra, and MongoDB
Query data interactively with Drill, Phoenix, and Presto
Choose an appropriate data storage technology for your application
Understand how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie.
Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
Consume streaming data using Spark Streaming, Flink, and Storm

Spark

发表于 2020-01-14 更新于 2024-11-29 分类于 Hadoop

参考 http://dblab.xmu.edu.cn/blog/985-2/ https://alison.com/topic/learn/128563/introduction-to-scala-and-apache-spark

1. Introduction to Spark

There's a lot of cool features built on top of Spark, like things for machine learning and graph analysis and streaming data.

1.1 Spark 运行架构

1.1.1 架构设计

Spark运行架构包括集群资源管理器（Cluster Manager）、运行作业任务的工作节点（Worker Node）、每个应用的任务控制节点（Driver）和每个工作节点上负责具体任务的执行进程（Executor）。其中，集群资源管理器可以是Spark自带的资源管理器，也可以是YARN或Mesos等资源管理框架。

如下图所示，在Spark中，一个应用（Application）由一个任务控制节点（Driver）和若干个作业（Job）构成，一个作业由多个阶段（Stage）构成，一个阶段由多个任务（Task）组成。当执行一个应用时，任务控制节点会向集群管理器（Cluster Manager）申请资源，启动Executor，并向Executor发送应用程序代码和文件，然后在Executor上执行任务，运行结束后，执行结果会返回给任务控制节点，或者写到HDFS或者其他数据库中。

1.1.2 运行流程

Spark的基本运行流程如下：（1）当一个Spark应用被提交时，首先需要为这个应用构建起基本的运行环境，即由任务控制节点（Driver）创建一个SparkContext，由SparkContext负责和资源管理器（Cluster Manager）的通信以及进行资源的申请、任务的分配和监控等。SparkContext会向资源管理器注册并申请运行Executor的资源；（2）资源管理器为Executor分配资源，并启动Executor进程，Executor运行情况将随着“心跳”发送到资源管理器上；（3）SparkContext根据RDD的依赖关系构建DAG图，DAG图提交给DAG调度器（DAGScheduler）进行解析，将DAG图分解成多个“阶段”（每个阶段都是一个任务集），并且计算出各个阶段之间的依赖关系，然后把一个个“任务集”提交给底层的任务调度器（TaskScheduler）进行处理；Executor向SparkContext申请任务，任务调度器将任务分发给Executor运行，同时，SparkContext将应用程序代码发放给Executor；（4）任务在Executor上运行，把执行结果反馈给任务调度器，然后反馈给DAG调度器，运行完毕后写入数据并释放所有资源。

总体而言，Spark运行架构具有以下特点：（1）每个应用都有自己专属的Executor进程，并且该进程在应用运行期间一直驻留。Executor进程以多线程的方式运行任务，减少了多进程任务频繁的启动开销，使得任务执行变得非常高效和可靠；（2）Spark运行过程与资源管理器无关，只要能够获取Executor进程并保持通信即可；（3）Executor上有一个BlockManager存储模块，类似于键值存储系统（把内存和磁盘共同作为存储设备），在处理迭代计算任务时，不需要把中间结果写入到HDFS等文件系统，而是直接放在这个存储系统上，后续有需要时就可以直接读取；在交互式查询场景下，也可以把表提前缓存到这个存储系统上，提高读写IO性能；（4）任务采用了数据本地性和推测执行等优化机制。数据本地性是尽量将计算移到数据所在的节点上进行，即“计算向数据靠拢”，因为移动计算比移动数据所占的网络资源要少得多。而且，Spark采用了延时调度机制，可以在更大的程度上实现执行过程优化。比如，拥有数据的节点当前正被其他的任务占用，那么，在这种情况下是否需要将数据移动到其他的空闲节点呢？答案是不一定。因为，如果经过预测发现当前节点结束当前任务的时间要比移动数据的时间还要少，那么，调度就会等待，直到当前节点可用。

1.2 Spark 特点

1.2.1 Scalable

1.2.2 Fast

Spark除了比MapReduce更快以外，还有一个优势：MapReduce is very limited in what it can do. You have to think about things in terms of mappers and reducers, whereas Spark provides a framework for removing that level of though from you, you can just think more about your end results and program toward that and think less about how to actual distribute it across the cluster.

阅读全文 »

存储过程

发表于 2020-01-13 分类于数据库

存储过程（Stored Procedure）是一种在数据库中存储复杂程序，以便外部程序调用的一种数据库对象。

存储过程是为了完成特定功能的SQL语句集，经编译创建并保存在数据库中，用户可通过指定存储过程的名字并给定参数(需要时)来调用执行。

存储过程思想上很简单，就是数据库 SQL 语言层面的代码封装与重用。

什么是存储过程？

存储过程就是作为可执行对象存放在数据库中的一个或多个SQL命令。通俗来讲：存储过程其实就是能完成一定操作的一组SQL语句。

为什么要使用存储过程？

1.存储过程只在创造时进行编译，以后每次执行存储过程都不需再重新编译，而一般SQL语句每执行一次就编译一次,所以使用存储过程可提高数据库执行速度。 2.当对数据库进行复杂操作时，可将此复杂操作用存储过程封装起来与数据库提供的事务处理结合一起使用。 3.存储过程可以重复使用，可减少数据库开发人员的工作量。 4.安全性高，可设定只有某些用户才具有对指定存储过程的使用权

优点

存储过程可封装，并隐藏复杂的商业逻辑。
存储过程可以回传值，并可以接受参数。
存储过程无法使用 SELECT 指令来运行，因为它是子程序，与查看表，数据表或用户定义函数不同。
存储过程可以用在数据检验，强制实行商业逻辑等。

缺点

存储过程，往往定制化于特定的数据库上，因为支持的编程语言不同，当切换到其他厂商的数据库系统时，需要重写原有的存储过程。
存储过程的性能调校与撰写，受限于各种数据库系统。

存储过程：存储过程就是编译好了的一些sql语句。

存储过程因为SQL语句已经预编绎过了，因此运行的速度比较快。
可保证数据的安全性和完整性。通过存储过程可以使没有权限的用户在控制之下间接地存取数据库，从而保证数据的安全。通过存储过程可以使相关的动作在一起发生，从而可以维护数据库的完整性。
可以降低网络的通信量。存储过程主要是在服务器上运行，减少对客户机的压力。
存储过程可以接受参数、输出参数、返回单个或多个结果集以及返回值。可以向程序返回错误原因
存储过程可以包含程序流、逻辑以及对数据库的查询。同时可以实体封装和隐藏数据逻辑。