(原创文章,转载请注明出处!)
用户对物品的推荐数据通常形成一个巨大的矩阵,而且通常用户的数量比物品的数量多,可以通过SVD(奇异值分解)来将矩阵分解,减少计算中使用的数据量,降低计算的复杂度。假设数据R是m x n矩阵,m个用户,n个物品,通过奇异值分解,R=U∑VT。那么将R投影到低维的k(k < min(m,n))空间:Rk=RTUk∑k,RT是R的转置 n x m矩阵, Uk是m x k矩阵,∑k是 k x k 对角阵,所以投影完成后的矩阵Rk是 n x k 矩阵,每一行代表一个物品。
计算过程:
1. 对原始数据进行normalization(z-Score)
2. 对normalization后的数据矩阵进行SVD分解,将数据矩阵投影到新的低维空间
3. 使用IBCF来计算推荐结果
实现代码如下:
1 ## Decompose the rating matrix with SVD and project the rating matix 2 ## to lower dimension space. Find the top n items as the item recommendation 3 ## list with the Item-Based Collaborative Filtering algorithm over the 4 ## lower dimension data. 5 ## Args : 6 ## x - a matrix, contain all rating reslut. 7 ## Each colum is the rating by one user, each row is the rating of one movie. 8 ## If a movie hasn't been rated by a user, the corresponding postion in the matrix is NA. 9 ## userI - index of specified user 10 ## k - k nearest neigbour of useriI 11 ## n - top n items that will be recommended to user-I 12 ## pc_threshold - principal component threshold 13 ## Returns : 14 ## a list, contains recommendation result 15 svdRecommendationIBCF <- function(x, userI, k, n, pc_threshold=0.9) 16 { 17 # todo: how to calculate the Pearson correlation coefficient between two vectors 18 # sum((x - u_x)*(y - u_y)) / (sd_x*sd_y) 19 20 x[which(is.na(x))] <- 0 21 ## normalize the data 22 normlizedResult <- zScoreNormalization( x ) 23 x <- t( normlizedResult$xNormalized ) 24 25 ## svd decomposition 26 svd_x <- svd(x) 27 # find the top-k singular value 28 numTopSV <- 0 29 for(sv in svd_x$d) { 30 numTopSV <- numTopSV + 1 31 if ( (sum(svd_x$d[1:numTopSV]) / sum(svd_x$d)) >= pc_threshold ) { 32 break 33 } 34 } 35 # project the rating data to lower dimension 36 # x_lowDim is a n-by-numTopSV matrix 37 # n is the number of items 38 # numTopSV is less than or equal to min(m , n) 39 x_lowDim <- t(x) %*% svd_x$u[,1:numTopSV] %*% diag(svd_x$d[1:numTopSV]) 40 41 42 ## predicting the rating of user-I's un-rated items 43 unRatedIdx <- which(x[,userI] == 0) 44 ratedIdx <- which(x[,userI] != 0) 45 ratingOfUnRatedItems <- numeric( dim(x)[1] ) 46 for (i in unRatedIdx) { 47 # calculate the Pearson correlation coefficient to each item 48 itemSim <- cor( x = x_lowDim[i,], y = t(x_lowDim[ratedIdx,]), use = "everything", method = "pearson" ) 49 itemSim <- 0.5 + 0.5*itemSim # keep the similarity in [0,1] 50 # find the k nearest items to item-i 51 KSimilarItemIdx <- apply( matrix(itemSim,nrow=1), 52 MARGIN=1, # apply the function to each row 53 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), k) 54 ) 55 KSimilarItemIdx <- as.vector(KSimilarItemIdx) 56 57 r <- x[ratedIdx,] 58 ratingOfUnRatedItems[i] <- sum( r[KSimilarItemIdx,userI] * itemSim[KSimilarItemIdx] ) 59 / sum( itemSim[KSimilarItemIdx] ) 60 if ( is.na(normlizedResult$meanOfcol[i]) || is.na(normlizedResult$sdOfcol[i]) ) { 61 next 62 } 63 ratingOfUnRatedItems[i] <- zScoreNormalizationInverse( ratingOfUnRatedItems[i], 64 normlizedResult$meanOfcol[i], 65 normlizedResult$sdOfcol[i] ) 66 } 67 68 ## find the Top-N items 69 topnIdx <- apply( matrix(ratingOfUnRatedItems,nrow=1), MARGIN=1, 70 FUN=function(x) head( order(x, decreasing=TRUE, na.last=TRUE), n ) ) 71 topnIdx <- as.vector(topnIdx) 72 recommendList <- list(ratingResult = ratingOfUnRatedItems[topnIdx], topnIndex = topnIdx) 73 return( recommendList ) 74 }
以上代码中使用到的zScoreNormalization,与zScoreNormalizationInverse函数在文章推荐系统(三)中有给出。
代码与推荐系统(三)中给出的IBCF代码的主要差别是在24-38行使用SVD对评分矩阵进行了分解,并将原始的评分矩阵投影到低维空间,47行在计算物品间相似性时使用了低维矩阵,可以在一定程度上降低计算的复杂度。