zoukankan      html  css  js  c++  java
  • 推荐系统(四)

    (原创文章,转载请注明出处!)

    用户对物品的推荐数据通常形成一个巨大的矩阵,而且通常用户的数量比物品的数量多,可以通过SVD(奇异值分解)来将矩阵分解,减少计算中使用的数据量,降低计算的复杂度。假设数据R是m x n矩阵,m个用户,n个物品,通过奇异值分解,R=U∑VT。那么将R投影到低维的k(k < min(m,n))空间:Rk=RTUkk,RT是R的转置 n x m矩阵, Uk是m x k矩阵,∑k是 k x k 对角阵,所以投影完成后的矩阵Rk是 n x k 矩阵,每一行代表一个物品。

    计算过程:

    1. 对原始数据进行normalization(z-Score)

    2. 对normalization后的数据矩阵进行SVD分解,将数据矩阵投影到新的低维空间

    3. 使用IBCF来计算推荐结果

    实现代码如下:

     1 ## Decompose the rating matrix with SVD and project the rating matix 
     2 ## to lower dimension space. Find the top n items as the item recommendation 
     3 ## list with the Item-Based Collaborative Filtering algorithm over the 
     4 ## lower dimension data.
     5 ## Args :
     6 ##      x  -  a matrix, contain all rating reslut. 
     7 ##            Each colum is the rating by one user, each row is the rating of one movie.
     8 ##            If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA.
     9 ##      userI - index of specified user
    10 ##      k  -  k nearest neigbour of useriI
    11 ##      n  -  top n items that will be recommended to user-I
    12 ##      pc_threshold  -  principal component threshold
    13 ## Returns :
    14 ##      a list, contains recommendation result
    15 svdRecommendationIBCF <- function(x, userI, k, n, pc_threshold=0.9)
    16 {
    17     # todo: how to calculate the Pearson correlation coefficient between two vectors
    18     # sum((x - u_x)*(y - u_y)) / (sd_x*sd_y)
    19     
    20     x[which(is.na(x))] <- 0
    21     ## normalize the data
    22     normlizedResult <- zScoreNormalization( x )
    23     x <- t( normlizedResult$xNormalized )
    24     
    25     ## svd decomposition
    26     svd_x <- svd(x)
    27     # find the top-k singular value
    28     numTopSV <- 0
    29     for(sv in svd_x$d) {
    30         numTopSV <- numTopSV + 1
    31         if ( (sum(svd_x$d[1:numTopSV]) / sum(svd_x$d)) >= pc_threshold ) {
    32             break
    33         }
    34     }
    35     # project the rating data to lower dimension
    36     # x_lowDim is a n-by-numTopSV matrix
    37     # n is the number of items
    38     # numTopSV is less than or equal to min(m , n)
    39     x_lowDim <- t(x) %*% svd_x$u[,1:numTopSV] %*% diag(svd_x$d[1:numTopSV])
    40     
    41     
    42     ## predicting the rating of user-I's un-rated items
    43     unRatedIdx <- which(x[,userI] == 0)
    44     ratedIdx <- which(x[,userI] != 0)
    45     ratingOfUnRatedItems <- numeric( dim(x)[1] )
    46     for (i in unRatedIdx) {        
    47         # calculate the Pearson correlation coefficient to each item
    48         itemSim <- cor( x = x_lowDim[i,], y = t(x_lowDim[ratedIdx,]), use = "everything", method = "pearson" )
    49         itemSim <- 0.5 + 0.5*itemSim # keep the similarity in [0,1]
    50         # find the k nearest items to item-i
    51         KSimilarItemIdx <- apply( matrix(itemSim,nrow=1), 
    52                                   MARGIN=1,  # apply the function to each row
    53                                   FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), k)
    54                                 )
    55         KSimilarItemIdx <- as.vector(KSimilarItemIdx)                              
    56 
    57         r <- x[ratedIdx,]
    58         ratingOfUnRatedItems[i] <- sum( r[KSimilarItemIdx,userI] * itemSim[KSimilarItemIdx] )   
    59                                    /   sum( itemSim[KSimilarItemIdx] )
    60         if ( is.na(normlizedResult$meanOfcol[i]) || is.na(normlizedResult$sdOfcol[i]) ) {
    61             next
    62         }
    63         ratingOfUnRatedItems[i] <- zScoreNormalizationInverse( ratingOfUnRatedItems[i], 
    64                                                             normlizedResult$meanOfcol[i], 
    65                                                             normlizedResult$sdOfcol[i] )
    66     }
    67     
    68     ## find the Top-N items
    69     topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 
    70                      FUN=function(x) head(  order(x, decreasing=TRUE, na.last=TRUE), n )  )
    71     topnIdx <- as.vector(topnIdx)
    72     recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx)
    73     return( recommendList )
    74 }

    以上代码中使用到的zScoreNormalization,与zScoreNormalizationInverse函数在文章推荐系统(三)中有给出。

    代码与推荐系统(三)中给出的IBCF代码的主要差别是在24-38行使用SVD对评分矩阵进行了分解,并将原始的评分矩阵投影到低维空间,47行在计算物品间相似性时使用了低维矩阵,可以在一定程度上降低计算的复杂度。

  • 相关阅读:
    [转]Code! MVC 5 App with Facebook, Twitter, LinkedIn and Google OAuth2 Sign-on (C#)
    [转]OAuth 2.0
    SpringMVC之七:SpringMVC中使用Interceptor拦截器
    多数据源问题--Spring+Ibatis 访问多个数据源(非分布式事务)
    读写分离
    SVN中检出(check out) 和 导出(export) 的区别
    Hbase之三:Hbase Shell使用入门
    hadoop之一:概念和整体架构
    Twitter Storm如何保证消息不丢失
    Twitter Storm: storm的一些常见模式
  • 原文地址:https://www.cnblogs.com/activeshj/p/4012637.html
Copyright © 2011-2022 走看看