qypx の blog

Hive介绍

发表于 2020-01-12 更新于 2024-08-05 分类于 Hadoop

参考： https://blog.csdn.net/PowerBlogger/article/details/83626449 https://blog.csdn.net/u010886217/article/details/83796151 《Hadoop构建数据仓库实践》

更多内容见Hive官方文档 https://cwiki.apache.org/confluence/display/Hive/Home#Home-UserDocumentation

1. 什么是Hive？

Hive 是 Hadoop 生态圈的数据仓库软件，里面有表的概念，使用类似于 SQL 的语言（HiveQL）读、写、管理分布式存储上的大数据集。它建立在 Hadoop 之上，具有以下功能和特点：

通过 HiveQL 方便地访问数据，适合执行 ETL、报表查询、数据分析等数据仓库任务。
提供一种机制，给各种各样的数据格式添加结构。
直接访问 HDFS 的文件，或者访问如 HBase 等其他数据存储。
可以通过 MapReduce、Spark 或 Tez 等多种计算框架执行查询。

Hive 被设计成一个可扩展的、高性能的、容错的、与输入数据格式松耦合的系统，适合于数据仓库中的汇总、分析、批处理查询等任务，而不适合联机事务处理（OLTP）的工作场景。Hive 包括 HCatalog 和 WebHCat 两个组件。HCatalog 是 Hadoop 的表和存储管理层，允许使用 Pig 和 MapReduce 等数据处理工具的用户更容易读写集群中的数据。WebHCat 提供了一个服务，可以使用 HTTP 接口执行 MapReduce （或 YARN）、Pig、Hive 作业或元数据操作。

HiveQL 只处理结构化数据，并且不区分大小写（与SQL一样）。

Hive 里的数据最终存储在 HDFS 的文件中，常用的数据文件格式有以下4种：TEXTFILE、SEQUENCEFILE、RCFILE、ORCFILE。（关于文件格式的更详细内容见 3.1.2 file format）。在 Hive 中文件格式指的是记录以怎样的编码格式被存储到文件中。不同文件格式的主要区别在于它们的数据编码、压缩率、使用的空间和磁盘I/O。在加载数据的过程中，Hive不会对数据本身进行任何修改，而只是将数据内容复制或者移动到相应的HDFS目录中。

当用户向传统数据库中增加数据时，系统会检查写入的数据与表结构是否匹配，如果不匹配则拒绝插入数据，这就是所谓的写时模式。Hive与此不同，它使用的是读时模式，即直到读取时再进行数据校验（加载数据时不进行数据格式的校验，读取数据时如果不合法则显示NULL。这种模式的优点在于加载数据迅速）。在向 Hive 装载数据时，它并不验证数据与表结构是否匹配，但这时它会检查文件格式是否和表定义相匹配。

阅读全文 »

笔试知识点

发表于 2020-01-12 更新于 2020-02-25 分类于笔试

阅读全文 »

sql语言

发表于 2020-01-12 更新于 2024-08-01 分类于数据库

DDL与DML

DDL: maintaining structure of database. (CREATE, DROP, ALTER, RENAME)

DML: maintaining contents of database. (SELECT, INSERT, DELETE, UPDATE)

sql language:

阅读全文 »

sql语句中的drop, truncate, delete

发表于 2020-01-12 更新于 2020-10-13 分类于数据库

参考：

https://blog.csdn.net/lovezhaohaimig/article/details/80184994

https://www.cnblogs.com/xianyao/p/11613021.html

1.drop table 表名称

drop (删除表)：删除表内容和表定义，释放空间。

简单来说就是把整个表去掉，以后要新增数据是不可能的，除非新增一个表。

把表的结构也删除了，下次要使用的时候要重新创建表的结构再插入数据。

drop语句将删除表结构被依赖的约束(constrain)，触发器(trigger)，索引(index)；依赖于该表的存储过程/函数将被保留，但其状态会变为：invalid。

不能回滚

2.truncate table 表名称

truncate (清空表中的数据)：删除表内容、释放空间但不删除表定义。与drop不同的是，他只是清空表数据而已，不删除表结构。其列、约束、索引等保持不变。

truncate table test 后，向test表添加数据，插入的字段的id重新从1开始递增 1、2、3.....（体现了truncate删除是释放空间）

in MySQL, resets auto_increment PKs

不能回滚

注意:truncate 不能删除行数据,要删就要把表清空。

3.delete from 表名称 where 列名称 = 值

delete (删除表中的数据)：删除整表中的行，表结构不会删除。

删除内容，不删除定义，不释放空间。

用delete删除数据，然后添加。可以看到添加之后id标识不连续。（说明delete删除不释放空间）

如果重新插入数据时，对应的id在上次基础之上递增 4、5、6....

delete语句执行删除的过程是每次从表中删除一行，并且同时将该行的删除操作作为事务记录在日志中保存,以便进行回滚操作。

即不带where的delete: 删除表内容，不删除表结构

可回滚

4.总结

执行速度，一般来说: drop> truncate > delete。
delete语句是数据库操作语言(dml)，这个操作会放到 rollback segment中，事务提交之后才生效；如果有相应的 trigger，执行的时候将被触发。
truncate、drop 是数据库定义语言(ddl)，操作立即生效，原数据不放到 rollback segment 中，不能回滚，操作不触发 trigger。
在实际应用中，三者的区别是明确的。
- 当你不再需要该表时，用 drop;
- 当你仍要保留该表，但要删除所有记录时，用 truncate;
- 当你要删除部分记录时（always with a WHERE clause), 用 delete.
truncate 与delete 比较：
- truncate table 在功能上与不带 WHERE 子句的 delete语句相同：二者均删除表中的全部行。
- truncate 比 delete速度快，且使用的系统和事务日志资源少。
- truncate 操作后的表比Delete操作后的表要快得多。
- 当表被清空后，表和表的索引将重新设置成初始大小，而delete则不能。
- truncate 删除不能恢复，delete 可以恢复数据

(http://www.itpub.net/thread-392126-1-1.html)

问：执行truncate table，除了rows会删除，index也会删除，但我执行之后，查看user_indexes，索引仍在，重新加入新资料再做查询，一样可以用到索引，是不是这里所说的删除，是指删除索引的内容，可是一般我们用delete，索引的内容不是也会跟着删除吗? 那么用truncate和delete，索引的删除又有什么不同呢?

答：你truncate后，你只是删除索引相应的数据，也就是你所说的内容，索引的定义并没有删除，因此在数据字典里面还有相关信息。重新插入数据后，当然还会用到原来定义的索引。你要真正删除一个索引的话，那么得用drop index index_name把索引DROP掉。delete的话，索引的内容跟truncate一样是删除的，只是物理上事实上并没有删除，这个你不要去管，你就知道索引条目也相应删除就是了。

两数之和

发表于 2020-01-12 分类于 leetcode

两数之和（难度：简单）

方法一：暴力法

遍历每个元素 x，并查找是否存在一个值与 target - x相等的目标元素。

class Solution {
    public int[] twoSum(int[] nums, int target) {
        for (int i = 0; i < nums.length; i++) {
            for (int j = i + 1; j < nums.length; j++) {
                if (nums[j] == target - nums[i]) {
                    return new int[] { i, j };
                }
            }
        }
        throw new IllegalArgumentException("No two sum solution");
    }
}

复杂度

时间复杂度：O(n^2)

空间复杂度：O(1)

方法二：两遍哈希表

为了对运行时间复杂度进行优化，我们需要一种更有效的方法来检查数组中是否存在目标元素。如果存在，我们需要找出它的索引。保持数组中的每个元素与其索引相互对应的最好方法是什么？哈希表。

通过以空间换取速度的方式，我们可以将查找时间从 O(n)降低到 O(1)。哈希表正是为此目的而构建的，它支持以近似恒定的时间进行快速查找。我用“近似”来描述，是因为一旦出现冲突，查找用时可能会退化到 O(n)。但只要你仔细地挑选哈希函数，在哈希表中进行查找的用时应当被摊销为 O(1)。

使用两次迭代。在第一次迭代中，我们将每个元素的值和它的索引添加到表中。然后，在第二次迭代中，我们将检查每个元素所对应的目标元素（target - nums[i]）是否存在于表中。注意，该目标元素不能是 nums[i]本身！

class Solution {
    public int[] twoSum(int[] nums, int target) {
        Map<Integer, Integer> map = new HashMap<>();
        for (int i = 0; i < nums.length; i++) {
            map.put(nums[i], i);
        }
        for (int i = 0; i < nums.length; i++) {
            int complement = target - nums[i];
            if (map.containsKey(complement) && map.get(complement) != i) {
                return new int[] { i, map.get(complement) };
            }
        }
        throw new IllegalArgumentException("No two sum solution");
    }
}

复杂度

时间复杂度：O(n)，遍历两次，由于哈希表将查找时间缩短到 O(1) ，所以时间复杂度为 O(n)。

空间复杂度：O(n)，所需的额外空间取决于哈希表中存储的元素数量。

方法三：一遍哈希表

在进行迭代并将元素插入到表中的同时，回过头来检查表中是否已经存在当前元素所对应的目标元素。如果它存在，那我们已经找到了对应解，并立即将其返回。

class Solution {
    public int[] twoSum(int[] nums, int target) {
        Map<Integer, Integer> map = new HashMap<>();
        for(int i = 0; i < nums.length; i++){
            
            if(map.containsKey(target - nums[i]))
                return new int[] {map.get(target - nums[i]), i};
            map.put(nums[i],i);

        }
        
        return new int[] {};
    }
}

复杂度

时间复杂度：O(n)，只遍历一次。

空间复杂度：O(n)。

最长连续递增序列

发表于 2020-01-12 分类于 leetcode

最长连续递增序列（难度：简单）

代码

class Solution {
    public int findLengthOfLCIS(int[] nums) {
        if(nums.length<=1)
            return nums.length;
        int count = 1;
        int max = 1;
        for(int i = 1; i < nums.length; i++){
            if(nums[i] > nums[i-1]){
                count++;
                max = Math.max(max,count);
            } else
                count=1;
        }
        return max;
    }
}

排序算法

发表于 2020-01-10 更新于 2020-02-22

排序算法的稳定性，通俗地讲就是能保证排序前2个相等的数其在序列的前后位置顺序和排序后它们两个的前后位置顺序相同。在简单形式化一下，如果Ai = Aj，Ai原来在位置前，排序后Ai还是要在Aj位置前。

* 基选归堆不变（运行时间不发生变化，与初始状态无关）

* 快选希堆不稳（是不稳定的排序）

快速排序：

思想：

1 ．在待排序的元素任取一个元素作为基准（通常选第一个元素，但最的选择方法是从待排序元素中随机选取一个作为基准），称为基准元素；

2 ．将待排序的元素进行分区，比基准元素大的元素放在它的右边，比其小的放在它的左边；

3 ．对左右两个分区重复以上步骤直到所有元素都是有序的

冒泡排序：

1 ．比较相邻的元素。如果第一个比第二个大，就交换他们两个

2 ．对每一对相邻元素作同样的工作，从开始第一对到结尾的最后一对。在这一点，最后的元素应该会是最大的数。

3 ．针对所有的元素重复以上的步骤，除了最后一个

4 ．持续每次对越来越少的元素重复上面的步骤，直到没有任何一对数字需要比较。

归并排序：

第一步：申请空间，使其大小为两个已经排序序列之和，该空间用来存放合并后的序列

第二步：设定两个指针，最初位置分别为两个已经排序序列的起始位置

第三步：比较两个指针所指向的元素，选择相对小的元素放入到合并空

间，并移动指针到下一位置

重复步骤 3 直到某一指针超出序列尾

将另一序列剩下的所有元素直接复制到合并序列尾

插入排序：

1 ．从有序数列和无序数列 { a2 ， a3 , …, an} 开始进行排序

2 ．处理第 i 个元素时 { i ： 2 ， 3 ，... , n } , 数列 { a1 ， a2 ，…， ai-1} 是已有序的，而数列 {ai,ai+1, …， an } 是无序的。用 ai 与 ai-1, a i-2, …， a1 进行比较，找出合适的位置将 ai 插入；

3 ．重复第二步，共进行 n - i 次插入处理，数列全部有序。

选择排序 (Selection sort)：

是一种简单直观的排序算法。它的工作原理是每一次从待排序的数据元素中选出最小（或最大）的一个元素，存放在序列的起始位置，直到全部待排序的数据元素排完，所以每一趟选择的元素都会放在他的最终位置

第 i 趟排序，找到L[i...n]中最小的元素与L[i]交换位置，这样保证每一趟排序确定一个元素的最终位置

堆排序：

如果要求升序则建立大根堆，降序则建立小根堆，堆顶元素为最大或者最小的元素，将这个元素与最后一个位置的元素交换，再将剩余元素还原成大小跟堆，每一趙都会选出一个未排序中的最大或者最小放大他的最终位置

希尔排序 (shell's sort)：

希尔排序(Shell's Sort)是插入排序的一种，又称“缩小增量排序”（Diminishing Increment Sort），是直接插入排序算法的一种更高效的改进版本。希尔排序是非稳定排序算法。

希尔排序是把记录按下标的一定增量分组，对每组使用直接插入排序算法排序；随着增量逐渐减少，每组包含的关键词越来越多，当增量减至1时，整个文件恰被分成一组，算法便终止。

由于是按照增量排序，步长不同可能元素不一定到他最终位置

Matrix Factorization Methods

发表于 2020-01-09 更新于 2020-01-10 分类于 Recommender System

Collaborative Filtering has been criticized as having limited scalability, since computing similarity matrices on very large sets of items or users can take a lot of computing horsepower. I don't really buy this, however. Technologies such as Apache Spark allow you to distribute the construction of this matrix across a cluster if you need to. A legitimate problem with collaborative filtering is that it's sensitive to noisy data and sparse data. You'll only get really good results if you have a large data set to work with that's nice and clean.

接下来介绍一些Model-based methods. Instead of 寻找相似的物品或相似的用户, we apply data science and machine learning techniques to extract predictions from our ratings data.

There are a wide variety of techniques that fall under the category of matrix factorization. They managed to find broader features of users and items on their own, like action movies or romantic. They are described by matrices. The general idea is to describe users and movies as combinations of different amounts of each feature. For example, Bob is defined as being 80% an action fan and 20% a comedy fan. We'd then know to match him up with movies that are blend about 80% action and 20% comedy.

Neighborhood-Based Collaborative Filtering 协同过滤

发表于 2020-01-07 更新于 2020-07-29 分类于 Recommender System

This is the idea of leveraging the behavior of others to inform what you might enjoy. At a very high level, it means finding other people like you and recommending stuff they liked. Or it might mean finding other things similar to the thing that you like.

That's why we call it collaborative filtering. It's recommending stuff based on other people's collaborative behavior.

The heart of neighborhood-based collaborative filtering is the ability to find people similar to you, or items similar to items you've liked.

1. Measuring Similarity, and Sparsity

这里的余弦相似度和前面content-based中的余弦相似度的区别在于：Our dimensions would be things like his: Did this user like this thing? Or was this thing liked by this user? So every user or every thing might constitute its own dimension, and the dimensions are based on user behavior instead of content attributes.

The big challenge in measuring these similarities based on behavior data is the sparsity of the data we're working with. This means that it's tough for collaborative filtering to work well unless you have a lot of user behavior data to work with. You can't compute a meaningful cosine similarity between two people when they have nothing in common, or between two items when they have no people in common.

This is why collaborative filtering works well for big companies like Amazon and Netflix. They have millions of users and so they have enough data to generate meaningful relations in spite of the data sparsity.

Sparsity also introduces some computational challenges. You don't want to waste time storing and processing all of that missing data, so under the hood we end up using structures like sparse arrays that avoid storing all that empty spaces in this matrix.

2. Similarity Metrics

Adjusted Cosine

It's applicable mostly to measuring the similarity between users based on their ratings. It's based on the idea that different people might have different baselines. What Bob considers a three star movie, maybe different from what Alice considers a three star movie.

Adjusted cosine attempts to normalize these differences. Instead of measuring similarities between people based on their raw rating values, we instead measure similarities based on the difference between a user's rating for an item and their average rating for all items.

Now this sounds good on paper, but in practice, data sparsity can really mess you up here. You can only get a meaningful average, or a baseline of an individual's ratings if they have rated a lot of stuff for you to take the average of in the first place. 若有许多用户只评价了一部电影，then that data will be totally wasted with the adjusted cosine metric. 不管他们评了多少分，the difference between it and that user's mean will be zero at that point.

So, adjusted cosine might be worth experimenting with, but only if you know that most of your users have rated a lot of stuff implicitly or explicitly. And if you have that much data to begin with, these differences between individuals will start to work themselves out anyway. So, you're not likely to see as much of a difference as you might expect when using adjusted cosine.

(item-based) pearson similarity

与adjusted cosine的区别在于, we look at the difference between rating and the average from all users for that given item.

You can think of pearson similarity as measuring the similarity between people by how much they diverge from the average person's behavior. 例如，假设大多数人都喜欢Star Wars, people who hate Star Wars are going to get a very strong similarity score from pearson similarity, because they share opinions that are not mainstream.

Note that the only difference between this and adjusted cosine is whether we're talking about users or items. 这门课使用的suprise library, refers to adjusted cosine as user-based pearson similarity, because it's basically the same thing.

spearman rank correlation

Instead of using an average rating value for a movie, 我们使用 its rank amongst all movies based on their average rating.
Instead of individual ratings for a movie, we'd rank that movie amongst all that individual's ratings.

Spearman的主要优势在于 it can deal with ordinal data effectively. 例如，if you had a rating scale, where the difference in meaning between different rating values were not the same. I've never seen this actually used in real world applications, but you may encounter it in the academic literature.

mean squared difference

另一方面，我们可以将它应用到item上：x和y指两个不同的item, and then we'd be looking at the differences in ratings from the people these items have in common (所有评价过这两个item的人对这两个item评分的差异).

There are two ways of doing collaborative filtering, item based and user based, and it's important to remember that most of these similarity metrics can apply to either approach.

MSD 要比余弦相似度更好理解，但 in practice, 你通常会发现cosine works better.

jaccard similarity

jaccard similarity 没有使用实际的rating vlaue. If you are dealing with implicit ratings, for example, just the fact that somebody watched something, in this case, Jaccard can be a reasonable choice that's very fast to compute.

Recap

Cosine similarity is almost always a reasonable thing to start with.
Adjust cosine and Pearson are two different terms for basically the same thing. It's mean centered cosine similarities. The idea is to deal with unusual rating behavior that deviates from the mean, but in practice, it can sometimes do more harm than good.
Spearman ranking correlation is the same idea as Pearson, using ranking instead of raw ratings, and it's not something you are likely to be using in practice.
MSD is mean squared difference, which is just an easier similarity metric to warp your head around than cosine similarity, but in practice, it usually doesn't perform better.
Jaccard similarity is just looking at how many items two users have in common, or how many users two items have in common, divided by how many items or users they have between both of them. It's really simple and well suited to implicit ratings, like binary actions, like purchasing or view something. But you can also apply cosine similarities to implicit ratings too.
So, and the end of the day, cosine similarity remains my default go to similarity metric.

3. 协同过滤

3.1 User-based Collaborative Filtering 基于用户的协同过滤(UserCF)

与基于物品的协同过滤类似的，不同的是，基于物品的协同过滤的原理是用户 U 购买了 A 物品，推荐给用户 U 和 A 相似的物品 B、C、D。而基于用户的协同过滤，是先计算用户 U 与其他的用户的相似度，然后取和 U 最相似的几个用户，把他们购买过的物品推荐给用户U。

3.1.1 计算用户之间的相似度

参考 https://github.com/NLP-LOVE/ML-NLP/tree/master/Project/17.%20Recommendation%20System

为了计算用户相似度，我们首先要把用户购买过物品的索引数据转化成物品被用户购买过的索引数据，即物品的倒排索引：

建立好物品的倒排索引后，就可以根据相似度公式计算用户之间的相似度：

其中 \(N(a)\) 表示用户 \(a\) 购买物品的数量，\(N(b)\) 表示用户 \(b\) 购买物品的数量，\(N(a)∩N(b)\) 表示用户 \(a\) 和 \(b\) 购买相同物品的数量。有了用户的相似数据，针对用户 U 挑选 k 个最相似的用户，把他们购买过的物品中，U 未购买过的物品推荐给用户 U 即可。

3.1.2 举例

每个人和自己的余弦相似度是1
Bob 和 Ted的余弦相似度是0，因为他们没有评价过相同的电影
矩阵的上三角部分和下三角部分是对称的
对于 Bob 和 Ann, 虽然他们评价过不同的电影，当我们考虑相似度时，我们只看他们共同评价过的电影。在这个例子中，他们共同评价过的电影只有一部，并且评分都为5，所以他们的相似度得分为1 (100%)。

注：两个用户的相似度为100%并不一定代表他们喜欢相同的东西，也可以代表他们都讨厌相同的东西（例如，若 Bob 和 Ann 都给Star Wars打1分，他们仍然是100% similar）。事实上，the math behind cosine similarity works out such that if you only have one movie in common, you end up with 100% similarity no matter what. Even if Bob loved Star Wars and Ann hated it, in a sparse data situation, they both end up 100% similar. Sparse data is a huge problem with collaborative filtering, and it can lead to weird results. And sometimes, you need to enforce a minimum threshold on how many movies users have in common before you consider them at all.

除了normalizing, 还可以例如将评分为1、2的转换成negative score, 方法不唯一，没有标准的方法，可通过试验看 what works best for the data you have.

应该adding in the score for a given movie if we encounter it more than once.

3.1.3 Recap

3.1.4 User-based Collaborative Filtering, Hands-On

code walkthrough

CollaborativeFiltering文件夹里的：

运行SimpleUserCF.py

3.2 Item-based Collaborative Filtering 基于物品的协同过滤(ItemCF)

Look at the things you liked, and recommend stuff that's similar to those things.

核心思想：给用户推荐那些和他们之前喜欢的物品相似的物品。

一些原因为什么使用物品之间的相似度可能比使用用户之间的相似度更好：

Items tend to be of a more permanent nature than people. An individual's test may change very quickly over the span of their lives. Your math book will always be similar to other math books. As such, you can get away with computing an item similarity matrix less often than user similarities, because it won't change very quickly.
You usually have far fewer items to deal with than people. There are way more people than there are things to recommend to them in most cases. 即物品相似度矩阵会比用户相似度矩阵会小很多，这样不仅 make it simpler to store that matrix, it makes faster to compute as well. And when you're dealing with massive systems like Amazon and Netflix, computational efficiency is very important. Not only does it require fewer resources, it means you can regenerate your similarities between items more often, making your system more responsive when new items are introduced.
使用 item similarities also makes for a better experience for new users. 一个新用户只要表现出了对某个物品的兴趣，你可以推荐与那个物品相似的物品给该用户，而使用基于用户的协同过滤，you wouldn't have any recommendations for a new user at all until they make it into the next build of your user similarity matrix.

在这个例子中，所有的相似度都为0或1，是因为数据量比较小。In the real world, you'd see more interesting and meaningful numbers here.

3.2.1 计算物品相似度的方法

参考 https://github.com/NLP-LOVE/ML-NLP/tree/master/Project/17.%20Recommendation%20System

基于物品的协同过滤算法首先计算物品之间的相似度，计算相似度的方法有以下几种：

1. 基于共同喜欢物品的用户列表计算

在此，分母中 \(N(i)\) 是购买物品 \(i\) 的用户数，\(N(j)\) 是购买物品 \(j\) 的用户数，而分子 \(N(i)∩N(j)\)是同时购买物品 \(i\) 和物品 \(j\) 的用户数。可见上述的公式的核心是计算同时购买这件商品的人数比例。当同时购买这两个物品的人数越多，他们的相似度也就越高。另外值得注意的是，在分母中我们用了物品总购买人数做惩罚，也就是说某个物品可能很热门，导致它经常会被和其他物品一起购买，所以除以它的总购买人数，来降低它和其他物品的相似分数。

2. 基于余弦的相似度计算

上面的方法计算物品相似度是直接计算同时购买这两个物品的人数。但是也可能存在用户购买了但不喜欢的情况，所以如果数据集包含了具体的评分数据，我们可以进一步把用户评分引入到相似度计算中。

其中 \(n_{ki}\) 是用户 \(k\) 对物品 \(i\) 的评分，如果没有评分则为 0。

3. 热门物品的惩罚

对于热门物品的问题，可以用如下公式解决：

当 \(\alpha ∈ (0，0.5)\)时，\(N(i)\) 越小，惩罚得越厉害，从而会使热门物品相关性分数下降。

3.2.2 Item-based Collaborative Filtering, Hands-On

code walkthrough

CollaborativeFiltering文件夹里的：

运行SimpleItemCF.py

Item-based collaborative filtering is what Amazon used with outstanding success.

You need to test it out on several real people if possible, and then move to a large scale A/B test to see if this algorithm really is better or worse than whatever you might have today.

3.3 矩阵分解

参考 https://github.com/NLP-LOVE/ML-NLP/tree/master/Project/17.%20Recommendation%20System

上述计算会得到一个相似度矩阵，而这个矩阵的大小和维度都是很大的，需要进行降维处理，用到的是SVD的降维方法，具体可以参考：降维方法

基于稀疏自编码的矩阵分解

矩阵分解技术在推荐领域的应用比较成熟，但是通过上一节的介绍，我们不难发现矩阵分解本质上只通过一次分解来对原矩阵进行逼近，特征挖掘的层次不够深入。另外矩阵分解也没有运用到物品本身的内容特征，例如书本的类别分类、音乐的流派分类等。随着神经网络技术的兴起，笔者发现通过多层感知机，可以得到更加深度的特征表示，并且可以对内容分类特征加以应用。首先，我们介绍一下稀疏自编码神经网络的设计思路。

基础的自编码结构

最简单的自编码结构如下图，构造个三层的神经网络，我们让输出层等于输入层，且中间层的维度远低于输入层和输出层，这样就得到了第一层的特征压缩。

简单来说自编码神经网络尝试学习中间层约等于输入层的函数。换句话说，它尝试逼近一个恒等函数。如果网络的输入数据是完全随机的，比如每一个输入都是一个跟其他特征完全无关的独立同分布高斯随机变，那么这一压缩表示将会非常难于学习。但是如果输入数据中隐含着些特定的结构，比如某些输入特征是彼此相关的，那么这一算法就可以发现输入数据中的这些相关性。
多层结构

基于以上的单层隐藏层的网络结构，我们可以扩展至多层网络结构，学习到更高层次的抽象特征。

4. Exercise

我们之前是选取的top-n，例如item-based: 选择用户评价前10的电影，再找similar items；user-based: 选择与用户相似度排前10的用户，再...

Maybe it would be better if instead of taking the top k sources for recommendation candidates, we just use any source above some given quality threshold. 例如，用户评价4星以上的电影 should generate item-based recommendation candidates, no matter how many or how few of them there may be.

可以发现一个小的改变可以使得结果很不一样。Often, you don't just want to test different recommendation algorithms, you want to test different variations and parameters on those algorithms. In this case, I not only want to test the idea of using a threshold instead of a top k approach, I'd also want to test many different threshold values to find the best one. In the real world, you'll find that your biggest problem is just not having enough time to run all of the different experiments you want to run to make your recommendations better.

5. Evaluating Collaborative Filtering Systems Offline

Now, although we can't measure accuracy with user based or item based collaborative filtering, because they don't make rating predictions, we can still measure hit rate, because it is still a top-N recommender.

code walkthrough

CollaborativeFiltering文件夹里的：

运行EvaluateUserCF.py

That was surprisingly fast. One really nice property of collaborative filtering is how quickly it can generate recommendations for any given individual, once we've built up the similarity matrix.

结果5.5% is pretty good.

Exercise

与EvaluateUserCF.py相比需要改变的地方：

item-based的hit rate只有0.5%，相比user-based的5.5%，要差一点。

Item-based should be a superior approach, and that's been proven in industry. 这里item-based要差一点应该是数据的原因。而且这只是offline evaluation. If we were to test both algorithms on real-world people using real-world data in an A/B test, the results could end up being very different.

6. KNN Recommenders

The concept of collaborative filtering has been applied to recommender systems that do make rating predictions, and these are generally referred to in the literature as KNN recommenders.

In this sort of system, we generate recommendation candidates by predicting the ratings of everything a user hasn't already rated, and selecting the top k items with the highest predicted ratings. This obviously isn't a terribly efficient approach, but since we're predicting rating values, we can measure the offline accuracy of the system using train/test or cross-validation, which is useful in the research world.

Running User and Item-based KNN on MovieLens

code walkthrough

CollaborativeFiltering文件夹里的：

运行KNNBakeOff.py

如果只看accuracy, KNN recommendation看上去是一个好方法，但是如果看 top-n recommendations, item-based和user-based推荐的都是些没听说过的电影 (obscure)，反而 random recommendation looks a lot better from a subjective standpoint.

So, on the surface, it looks like we may have made a system that's pretty good at predicting ratings of movies people have already seen, but might not be very good at producing top-n recommendations.

Exercise

Experiment with different KNN parameters.

虽然msd比cosine的RMSE要低一点，但它们推荐的内容是一样的

Well, it's actually pretty well known that KNN doesn't work well in practice. Unfortunately, some people conclude that collaborative filtering in general is some naive approach that should be replaced with completely different techniques. But as we've seen, collaborative filtering isn't the problem, it's forcing collaborative filtering to make rating predictions -- that's the problem. We had some pretty exciting results when we just focused on making top-N recommendations, and completely forgot about optimizing for rating accuracy, and it turns out that's what at the heart of the problem. 使用user-based CF和item-based CF, top-N 推荐结果都是不错的。

Ratings are not continuous in nature, and KNN treats them as though they are continuous values that can be predicted on a continuous scale. If you really want to go with KNN, it would be more appropriate to treat it as a rating classification problem than as a rating prediction problem. KNN is also very sensitive to sparse data.

The most fundamental thing is that accuracy isn't everything. The main reason KNN produces underwhelming results is because it's trying to solve the wrong problem. KNN recommender之所以推荐结果不好，是因为它解决的是rating prediction的问题，而不是推荐问题。

7. Bleeding Edge Alert

The idea behind it is that users are modeled as vectors moving from one item to another in a multidimensional space. And you can predict sequences of events, like which movie a user is likely to watch next, by modeling these vectors.

The reason this paper is exciting is because it outperformed all of the best existing methods for recommending sequences of events in all but one case in one data set.

Basic idea: You position individual items, like movies, in a transition space, where neighborhoods within this space represent similarity between items. So items close together in this space are similar to each other. The dimensions corresponds to complex transition relationships between items. Since this technique depends on arranging items together into local, similar neighborhoods, I still classify it as a neighborhood -based method.

In this space, we can learn the vectors associated with individual users. Maybe a user who watches a Tom Cruise movie, is likely to move along to the next Tom Cruise movie, for example, and that transition would be represented by a vector in this space. We can then predict the next movie a user is like to watch by extrapolating along the vector we've associated with that user.

The paper provide the code in C++.

It's all very advanced stuff, but it seems to work. So if you find yourself in the situation where you need to predict a sequence of events, like which movies or videos a person is likely to watch next given theirs past history, you might wanna do a search for translation-based recommendations and how it's coming along in the real world.

Content-Based Filtering

发表于 2020-01-06 更新于 2024-08-12 分类于 Recommender System

Recommending items just based on the attributes of those items, instead of trying to use aggregate user behavior data.

Cosine Similarity

为简单起见，假设电影只有两种类型：adventure和comedy. 一部电影若属于某种类型则为1，若不属于则为0.

角度 θ 一定程度上刻画了它们之间的相似度。我们想要将相似度刻画成[0,1]范围内的数，zero means not at all similar, and one means totally the same thing. 而 θ 的余弦值正好可以达到这个目的：θ 为90度，余弦值为0；θ 为0度，余弦值为1.

def computeGenreSimilarity(self, movie1, movie2, genres):
    genres1 = genres[movie1]
    genres2 = genres[movie2]
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(genres1)):
        x = genres1[i]
        y = genres2[i]
        sumxx += x * x
        sumyy += y * y
        sumxy += x * y
    return sumxy/math.sqrt(sumxx*sumyy)

How do we assign a similarity score based on release years alone?

This is where some of the art of recommender systems comes in. You have to think about the nature of the data you have and what makes sense.

How far apart would two movies have to be for their release date alone to signify they are substantially different? A decade seems like a reasonable starting point.

Now we need to come up with some sort of mathematical function that smoothly scales that into the range zero to one. -> 可选择指数衰减函数。

def computeYearSimilarity(self, movie1, movie2, years):
    diff = abs(years[movie1] - years[movie2])
    sim = math.exp(-diff / 10.0)
    return sim

The choice of this function is completely arbitrary, but it seems like a reasonable starting point. In the real world, you'd test many variations of this function to see what really produces the best recommendations with real people.

K-nearest-neighbors

So how do we turn these similarities between movies based on their attributes into actual rating predictions?

Code Walkthrough

ContentBased里的文件：

从结果可以看到，Content-based algorithm比random recommendations 表现得要好。

Bleeding Edge Alert

注：We often refer to the current state of the art as leading edge, but technology that's still so new that it's unproven in the real world can be risky to work with, and so we call that bleeding edge.

This is where we highlight some new research that looks interesting and promising, but hasn't really made it into mainstream yet with recommender systems.

mise en scene

Some recent research and content based filtering has surrounded the use of mise en scene data. Technically, mise en scene refers to the placement of objects in a scene, but the researchers are using this term a bit more loosely to refer to the properties of the scenes in a movie or movie trailer.

The idea is to extract properties from the film itself that can be quantified and analyzed, and see if we can come up with better movie recommendations by examining the content of the movie itself scene by scene.

What sort of attributes are we talking about：

In principle, this should give us a feel as to the pacing and mood of the film, just based on the film itself.

Q：这样的数据和原来使用的 human generated genre classification 相比，是否更加有效？

更改代码：

去掉ContentKNNAlgorithm.py中第45行的注释，再在第46行上加上 * mesSimilarity:

再运行ContentRec.py，可得到结果如下：

从结果上看，RMSE actually got a lot worse. 一方面，This could just be an artifact of how we chose to compute mise en scene similarity scores. 另一方面，Accuracy isn't really what we're concerned with.

Again, sometimes developing recommendation systems is more an art than a science. You can't really predict how real people will react to new recommendations they haven't seen before. Personally, I'd be tempted to test this in an A/B test to see how it performs.

If you look at the research literature associated with mise en scene recommendations however, they note that it doesn't do any favors to accuracy, but it does increase diversity. But again, increased diversity isn't always a good thing when it comes to recommendations. It may just mean that you're recommending random stuff that has no correlation to the user's actually interest. Still, it was interesting to experiment with it, and it would be even more interesting to experiment with it using real people

Here's a reference to the original research paper:

Exercise

Use genre, release year, and mise en scene data independently.

See if you improve the release year based recommendations by sorting the k nearest neighbors within a given year by popularity. (sort the year-based recommendations by popularity as a secondary sort)

Now by using popularity data, technically we're no longer limiting our recommendations to content attributes. Popularity is behavior-based data.