zoukankan      html  css  js  c++  java
  • 推荐系统(三)

    (原创文章,转载请注明出处!)

    推荐系统关注的是人与物品,希望预测出人对物品的喜欢程度。不同的人有相近的喜好(比如:都喜欢武侠小说),不同的物品有相近的特征(比如:都是武侠小说)。当想预测一个用户A对其还没有评价的的物品T的评分时,可以从两个角度来考虑:找和用户A有相近喜欢的人,通过他们对物品T的评分,来估计用户A对物品T的评分;另外一个角度是用户A已经评价过的物品,看看哪些物品与物品T比较相近,通过这些相近的物品,来估计用户A对物品T可能的评分。基于这两种思路得到了两种计算推荐系统评分的方法:基于用户的协同过滤法和基于物品的协同过滤法。

    一、基于用户的协同过滤法,User-Based Collaborative Filtering ( UBCF )

    1. 寻找相似用户

    思路一:计算用户A与所有对物品T评价过的其他用户的相似度,然后将与这些用户的相似度都应用到评分预测值的计算中;

    思路二:计算用户A与所有对物品T评价过的其他用户的相似度,取其中相似度最大的K个,将这K个应用到评分预测值的计算中;

    思路三:计算用户A与所有对物品T评价过的其他用户的相似度,设置一个阈值,取比阈值大的相似度,将这些用户的相似度应用到评分预测值的计算中。

    对于相似度的计算,可以有多种选择:皮尔逊相关系数(Pearson correlation coefficient)、夹角余弦欧式距离等。(R中的cor函数可以用来计算皮尔逊相关系数;dist函数可以用来计算欧式距离(daist函数也可以,不过需要先安装cluster包))

    2. 计算用户A对物品T的评分预测值

    寻找相似的用户后,可以计算这些相似用户对物品T的评分的平均值,以此作为用户A对物品T评分的预测;在相似的用户中,每个用户与用户A的相似度不尽相同,还可以使用相似度与评分的加权平均来作为用户A对物品T评分的预测。

    3. 实现

    下面使用余弦夹角度量相似度,找出最大的K个相似用户,并使用这些用户的评分来计算评分预测值。训练数据是一个矩阵,每行是一个物品收到的所有评价,每列是一个用户对所有物品的评价,评分值是:1-5, 没有评价过值是:NA,代码如下:

      1 ## normalize a vector with z-score method ( (x-u)/sigma )
      2 ## Args :
      3 ##     x - a matrix
      4 ## Returns :
      5 ##     a list contains, mean of each colum, 
      6 ##                      standard derivation of each colum
      7 ##                      normalized x
      8 zScoreNormalization <- function(x)
      9 {   # sapply(,FUN=function(x) ( (x - mean(x)) / sd(x) ))
     10     ## normalize the data
     11     meanOfcol <- numeric(dim(x)[2])
     12     sdOfcol <- numeric(dim(x)[2])
     13     for (i in 1:dim(x)[2]) {
     14         t <- x[,i]
     15         idx <- which(t != 0)  
     16         if (length(idx) <= 1) {
     17             meanOfcol[i] <- NA
     18             sdOfcol[i] <- NA
     19             next
     20         }
     21         meanOfcol[i] <- mean(t[idx])
     22         sdOfcol[i] <- sd(t[idx])
     23         x[idx,i] <- (t[idx] - mean(t[idx])) / sd(t[idx]) # z-score
     24     }
     25     
     26     return ( list(meanOfcol = meanOfcol, sdOfcol = sdOfcol, xNormalized=x) )
     27 }
     28 ## inverse the z-score normalized training data
     29 ## Args :
     30 ##     x  -  a vector, which need to be inversed
     31 ##     u  -  mean of original x
     32 ##     sigma  -  standard derivation of original x
     33 ## Returns :
     34 ##     inversed vector x
     35 zScoreNormalizationInverse <- function(x, u, sigma)
     36 {
     37     return (x*sigma + u)
     38 }
     39 
     40 ## calculate the consine of two vector angle
     41 ## Args :
     42 ##      x  -  a vector
     43 ##      y  -  a vector
     44 ## Returns :
     45 ##      cosine value of two vector's angle
     46 cosineSimilarity <- function(x, y) {
     47     if (length(x) != length(y)) {
     48         stop("Function cosineSimilarity : length of two parameter vectors is different!")
     49     }
     50     xx <- x
     51     yy <- y
     52     xx[which(is.na(xx))] <- 0
     53     yy[which(is.na(yy))] <- 0
     54     ## if  x and y is zero, return 0 without calculating
     55     if ( sum(abs(xx*yy)) == 0 ) {
     56         return (0)
     57     }
     58     
     59     sim <- sum(xx*yy) / ( sqrt(sum(xx^2)) * sqrt(sum(yy^2)) ) # cosine of vector angle
     60     return ( 0.5 + 0.5*sim )  # ensure the similarity is in range [0,1]
     61 }
     62 
     63 ## find the top n items as the item recommendation list with the User-Based Collaborative Filtering algorithm 
     64 ## Args :
     65 ##      x  -  a matrix, contain all rating reslut. 
     66 ##            Each colum is the rating by one user, each row is the rating of one movie.
     67 ##            If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA.
     68 ##      userI - index of specified user
     69 ##      k  -  k nearest neigbour of user I
     70 ##      n  -  top n items that will be recommended to user I
     71 ## Returns :
     72 ##      a list, contains recommendation result
     73 recommendationUBCF <- function(x, userI, k, n) 
     74 {
     75     x[which(is.na(x))] <- 0 
     76     ## normalize the data
     77     normlizedResult <- zScoreNormalization(x)
     78     x <- normlizedResult$xNormalized
     79     
     80     ## find the k similary users    
     81     userSimilarity <- numeric(dim(x)[2])
     82     for (i in 1:dim(x)[2]) {
     83         if (i == userI) {
     84             userSimilarity[i] <- -1
     85             next
     86         }
     87         userSimilarity[i] <- cosineSimilarity(x[,i], x[, userI])
     88     }
     89     KSimilarUserIdx <- apply( matrix(userSimilarity,nrow=1), 
     90                               MARGIN=1,  # apply the function to each colum
     91                               FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), k)
     92                             )
     93     KSimilarUserIdx <- as.vector(KSimilarUserIdx)
     94     
     95     ## predict the rating of un-rated items
     96     unRatedItems <- which( x[,userI]==0 ) 
     97     ratingOfUnRatedItems <- numeric( dim(x)[1] )
     98     for (i in unRatedItems) {
     99         ratingOfUnRatedItems[i] <- sum( x[i,KSimilarUserIdx] * userSimilarity[KSimilarUserIdx] )   
    100                                    /   sum( userSimilarity[KSimilarUserIdx] )
    101     }
    102     ratingOfUnRatedItems <- zScoreNormalizationInverse( ratingOfUnRatedItems, 
    103                                                         normlizedResult$meanOfUsers[userI], 
    104                                                         normlizedResult$sdOfusers[userI] )
    105     
    106     ## find the Top-N items
    107     topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 
    108                      FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), n )  )
    109     topnIdx <- as.vector(topnIdx)
    110     recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx)
    111     return( recommendList )       
    112 }

    二、基于物品的协同过滤法,Item-Based Collaborative Filtering ( IBCF )

    1. 算法流程

    1) 找出指定用户还没评价过的所有物品

    2) 对每个没有评价过的物品,寻找与其最相近的k个指定用户已经评价过的物品,利用这k个相近物品的评分以及相似度值,预测未评价物品的评分

    2. 实现

    使用皮尔逊相关系数来计算物品间的相似度,训练数据同UBCF一样,实现代码如下:

     1 ## find the top n items as the item recommendation list with the Item-Based Collaborative Filtering algorithm 
     2 ## Args :
     3 ##      x  -  a matrix, contain all rating reslut. 
     4 ##            Each colum is the rating by one user, each row is the rating of one movie.
     5 ##            If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA.
     6 ##      userI - index of specified user
     7 ##      k  -  k nearest neigbour of useriI
     8 ##      n  -  top n items that will be recommended to user-I
     9 ## Returns :
    10 ##      a list, contains recommendation result
    11 recommendationIBCF <- function(x, userI, k, n) 
    12 {
    13     # Pearson correlation coefficient between two vectors :
    14     # sum((x - u_x)*(y - u_y)) / (sd_x * sd_y)
    15     
    16     x[which(is.na(x))] <- 0 
    17     ## normalize the data
    18     normlizedResult <- zScoreNormalization( t(x) )
    19     x <- t( normlizedResult$xNormalized )
    20     
    21     ## predicting the rating of user-I's un-rated items
    22     unRatedIdx <- which(x[,userI] == 0)
    23     ratedIdx <- which(x[,userI] != 0)
    24     ratingOfUnRatedItems <- numeric( dim(x)[1] )
    25     for (i in unRatedIdx) {        
    26         # calculate the Pearson correlation coefficient to each item
    27         itemSim <- cor( x = x[i,], y = t(x[ratedIdx,]), use = "everything", method = "pearson" )
    28 
    29         # find the k nearest items to item-i
    30         KSimilarItemIdx <- apply( matrix(itemSim,nrow=1), 
    31                                   MARGIN=1,  # apply the function to each row
    32                                   FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), k)
    33                                 )
    34         KSimilarItemIdx <- as.vector(KSimilarItemIdx)                              
    35 
    36         # predicting the rating of un-rated item-i
    37         r <- x[ratedIdx,]
    38         ratingOfUnRatedItems[i] <- sum( r[KSimilarItemIdx,userI] * itemSim[KSimilarItemIdx] )  
    39 / sum( itemSim[KSimilarItemIdx] ) 40 if ( is.na(normlizedResult$meanOfcol[i]) || is.na(normlizedResult$sdOfcol[i]) ) { 41 next 42 } 43 ratingOfUnRatedItems[i] <- zScoreNormalizationInverse( ratingOfUnRatedItems[i], 44 normlizedResult$meanOfcol[i], 45 normlizedResult$sdOfcol[i] ) 46 } 47 48 ## find the Top-N items 49 topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 50 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), n ) ) 51 topnIdx <- as.vector(topnIdx) 52 recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx) 53 return( recommendList ) 54 }

     三、评分标准化,Normalization

    不同的用户有不同的评分偏好,比如:有人喜欢评分时均打较低的分,有人则喜欢均打较高的分,需要对数据进行标准化(normalization)的预处理,来消除评分偏好带来的影响。选择正规化方法的原则是标准化后,还能还原回去。通常的标准化方法有均值标准化,Z-score标准化。

    均值标准化的代码在文章推荐系统(二)中已经给出;Z-score标准化的实现代码见本文章上面的代码。

  • 相关阅读:
    磁盘 inodes 不足 Free inodes is less than 20% on volume
    记录一次Nginx使用第三方模块fair导致的线上故障排错
    xml 特殊字符 导致的 solr 数据导入异常
    Jenkins 定时备份插件 ThinBackup
    Elasticsearch 节点磁盘使用率过高,导致ES集群索引无副本
    Elasticsearch定时删除索引第二版
    js for in 获得遍历数组索引和对象属性
    js函数作用域
    django 1.11.1 连接MySQL
    angular 4 和django 1.11.1 前后端交互 总结
  • 原文地址:https://www.cnblogs.com/activeshj/p/3973918.html
Copyright © 2011-2022 走看看