zoukankan html css js c++ java

推荐系统（三）

(原创文章，转载请注明出处！)

推荐系统关注的是人与物品，希望预测出人对物品的喜欢程度。不同的人有相近的喜好（比如：都喜欢武侠小说），不同的物品有相近的特征（比如：都是武侠小说）。当想预测一个用户A对其还没有评价的的物品T的评分时，可以从两个角度来考虑：找和用户A有相近喜欢的人，通过他们对物品T的评分，来估计用户A对物品T的评分；另外一个角度是用户A已经评价过的物品，看看哪些物品与物品T比较相近，通过这些相近的物品，来估计用户A对物品T可能的评分。基于这两种思路得到了两种计算推荐系统评分的方法：基于用户的协同过滤法和基于物品的协同过滤法。

一、基于用户的协同过滤法，User-Based Collaborative Filtering ( UBCF )

1. 寻找相似用户

思路一：计算用户A与所有对物品T评价过的其他用户的相似度，然后将与这些用户的相似度都应用到评分预测值的计算中；

思路二：计算用户A与所有对物品T评价过的其他用户的相似度，取其中相似度最大的K个，将这K个应用到评分预测值的计算中；

思路三：计算用户A与所有对物品T评价过的其他用户的相似度，设置一个阈值，取比阈值大的相似度，将这些用户的相似度应用到评分预测值的计算中。

对于相似度的计算，可以有多种选择：皮尔逊相关系数（Pearson correlation coefficient）、夹角余弦、欧式距离等。（R中的cor函数可以用来计算皮尔逊相关系数；dist函数可以用来计算欧式距离(daist函数也可以，不过需要先安装cluster包)）。

2. 计算用户A对物品T的评分预测值

寻找相似的用户后，可以计算这些相似用户对物品T的评分的平均值，以此作为用户A对物品T评分的预测；在相似的用户中，每个用户与用户A的相似度不尽相同，还可以使用相似度与评分的加权平均来作为用户A对物品T评分的预测。

3. 实现

下面使用余弦夹角度量相似度，找出最大的K个相似用户，并使用这些用户的评分来计算评分预测值。训练数据是一个矩阵，每行是一个物品收到的所有评价，每列是一个用户对所有物品的评价，评分值是：1-5，没有评价过值是：NA，代码如下：

  1 ## normalize a vector with z-score method ( (x-u)/sigma )
  2 ## Args :
  3 ##     x - a matrix
  4 ## Returns :
  5 ##     a list contains, mean of each colum, 
  6 ##                      standard derivation of each colum
  7 ##                      normalized x
  8 zScoreNormalization <- function(x)
  9 {   # sapply(,FUN=function(x) ( (x - mean(x)) / sd(x) ))
 10     ## normalize the data
 11     meanOfcol <- numeric(dim(x)[2])
 12     sdOfcol <- numeric(dim(x)[2])
 13     for (i in 1:dim(x)[2]) {
 14         t <- x[,i]
 15         idx <- which(t != 0)  
 16         if (length(idx) <= 1) {
 17             meanOfcol[i] <- NA
 18             sdOfcol[i] <- NA
 19             next
 20         }
 21         meanOfcol[i] <- mean(t[idx])
 22         sdOfcol[i] <- sd(t[idx])
 23         x[idx,i] <- (t[idx] - mean(t[idx])) / sd(t[idx]) # z-score
 24     }
 25     
 26     return ( list(meanOfcol = meanOfcol, sdOfcol = sdOfcol, xNormalized=x) )
 27 }
 28 ## inverse the z-score normalized training data
 29 ## Args :
 30 ##     x  -  a vector, which need to be inversed
 31 ##     u  -  mean of original x
 32 ##     sigma  -  standard derivation of original x
 33 ## Returns :
 34 ##     inversed vector x
 35 zScoreNormalizationInverse <- function(x, u, sigma)
 36 {
 37     return (x*sigma + u)
 38 }
 39 
 40 ## calculate the consine of two vector angle
 41 ## Args :
 42 ##      x  -  a vector
 43 ##      y  -  a vector
 44 ## Returns :
 45 ##      cosine value of two vector's angle
 46 cosineSimilarity <- function(x, y) {
 47     if (length(x) != length(y)) {
 48         stop("Function cosineSimilarity : length of two parameter vectors is different!")
 49     }
 50     xx <- x
 51     yy <- y
 52     xx[which(is.na(xx))] <- 0
 53     yy[which(is.na(yy))] <- 0
 54     ## if  x and y is zero, return 0 without calculating
 55     if ( sum(abs(xx*yy)) == 0 ) {
 56         return (0)
 57     }
 58     
 59     sim <- sum(xx*yy) / ( sqrt(sum(xx^2)) * sqrt(sum(yy^2)) ) # cosine of vector angle
 60     return ( 0.5 + 0.5*sim )  # ensure the similarity is in range [0,1]
 61 }
 62 
 63 ## find the top n items as the item recommendation list with the User-Based Collaborative Filtering algorithm 
 64 ## Args :
 65 ##      x  -  a matrix, contain all rating reslut. 
 66 ##            Each colum is the rating by one user, each row is the rating of one movie.
 67 ##            If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA.
 68 ##      userI - index of specified user
 69 ##      k  -  k nearest neigbour of user I
 70 ##      n  -  top n items that will be recommended to user I
 71 ## Returns :
 72 ##      a list, contains recommendation result
 73 recommendationUBCF <- function(x, userI, k, n) 
 74 {
 75     x[which(is.na(x))] <- 0 
 76     ## normalize the data
 77     normlizedResult <- zScoreNormalization(x)
 78     x <- normlizedResult$xNormalized
 79     
 80     ## find the k similary users    
 81     userSimilarity <- numeric(dim(x)[2])
 82     for (i in 1:dim(x)[2]) {
 83         if (i == userI) {
 84             userSimilarity[i] <- -1
 85             next
 86         }
 87         userSimilarity[i] <- cosineSimilarity(x[,i], x[, userI])
 88     }
 89     KSimilarUserIdx <- apply( matrix(userSimilarity,nrow=1), 
 90                               MARGIN=1,  # apply the function to each colum
 91                               FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), k)
 92                             )
 93     KSimilarUserIdx <- as.vector(KSimilarUserIdx)
 94     
 95     ## predict the rating of un-rated items
 96     unRatedItems <- which( x[,userI]==0 ) 
 97     ratingOfUnRatedItems <- numeric( dim(x)[1] )
 98     for (i in unRatedItems) {
 99         ratingOfUnRatedItems[i] <- sum( x[i,KSimilarUserIdx] * userSimilarity[KSimilarUserIdx] )   
100                                    /   sum( userSimilarity[KSimilarUserIdx] )
101     }
102     ratingOfUnRatedItems <- zScoreNormalizationInverse( ratingOfUnRatedItems, 
103                                                         normlizedResult$meanOfUsers[userI], 
104                                                         normlizedResult$sdOfusers[userI] )
105     
106     ## find the Top-N items
107     topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 
108                      FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), n )  )
109     topnIdx <- as.vector(topnIdx)
110     recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx)
111     return( recommendList )       
112 }

二、基于物品的协同过滤法，Item-Based Collaborative Filtering ( IBCF )

1. 算法流程

1) 找出指定用户还没评价过的所有物品

2) 对每个没有评价过的物品，寻找与其最相近的k个指定用户已经评价过的物品，利用这k个相近物品的评分以及相似度值，预测未评价物品的评分

2. 实现

使用皮尔逊相关系数来计算物品间的相似度，训练数据同UBCF一样，实现代码如下：

 1 ## find the top n items as the item recommendation list with the Item-Based Collaborative Filtering algorithm 
 2 ## Args :
 3 ##      x  -  a matrix, contain all rating reslut. 
 4 ##            Each colum is the rating by one user, each row is the rating of one movie.
 5 ##            If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA.
 6 ##      userI - index of specified user
 7 ##      k  -  k nearest neigbour of useriI
 8 ##      n  -  top n items that will be recommended to user-I
 9 ## Returns :
10 ##      a list, contains recommendation result
11 recommendationIBCF <- function(x, userI, k, n) 
12 {
13     # Pearson correlation coefficient between two vectors :
14     # sum((x - u_x)*(y - u_y)) / (sd_x * sd_y)
15     
16     x[which(is.na(x))] <- 0 
17     ## normalize the data
18     normlizedResult <- zScoreNormalization( t(x) )
19     x <- t( normlizedResult$xNormalized )
20     
21     ## predicting the rating of user-I's un-rated items
22     unRatedIdx <- which(x[,userI] == 0)
23     ratedIdx <- which(x[,userI] != 0)
24     ratingOfUnRatedItems <- numeric( dim(x)[1] )
25     for (i in unRatedIdx) {        
26         # calculate the Pearson correlation coefficient to each item
27         itemSim <- cor( x = x[i,], y = t(x[ratedIdx,]), use = "everything", method = "pearson" )
28 
29         # find the k nearest items to item-i
30         KSimilarItemIdx <- apply( matrix(itemSim,nrow=1), 
31                                   MARGIN=1,  # apply the function to each row
32                                   FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), k)
33                                 )
34         KSimilarItemIdx <- as.vector(KSimilarItemIdx)                              
35 
36         # predicting the rating of un-rated item-i
37         r <- x[ratedIdx,]
38         ratingOfUnRatedItems[i] <- sum( r[KSimilarItemIdx,userI] * itemSim[KSimilarItemIdx] )  
39                                        /   sum( itemSim[KSimilarItemIdx] )
40         if ( is.na(normlizedResult$meanOfcol[i]) || is.na(normlizedResult$sdOfcol[i]) ) {
41             next
42         }
43         ratingOfUnRatedItems[i] <- zScoreNormalizationInverse( ratingOfUnRatedItems[i], 
44                                                             normlizedResult$meanOfcol[i], 
45                                                             normlizedResult$sdOfcol[i] )
46     }
47     
48     ## find the Top-N items
49     topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 
50                      FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), n )  )
51     topnIdx <- as.vector(topnIdx)
52     recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx)
53     return( recommendList )
54 }

三、评分标准化，Normalization

不同的用户有不同的评分偏好，比如：有人喜欢评分时均打较低的分，有人则喜欢均打较高的分，需要对数据进行标准化（normalization）的预处理，来消除评分偏好带来的影响。选择正规化方法的原则是标准化后，还能还原回去。通常的标准化方法有均值标准化，Z-score标准化。

均值标准化的代码在文章推荐系统（二）中已经给出；Z-score标准化的实现代码见本文章上面的代码。

查看全文

相关阅读:
标准输入输出
 UNIX基础概念
 phpstrom设置php环境
 nginx+php+swoole安装记录
 MySQL索引
 生成器来解决大文件读取，大数据下载
 PHP调优
 PHP-FPM详解
 远程登录服务器配置
 HTTPS配置

原文地址：https://www.cnblogs.com/activeshj/p/3973918.html