zoukankan      html  css  js  c++  java
  • 基于neighborhood models(item-based) 的个性化推荐系统


    文章主要介绍的是koren 08年发的论文[1],  2.2neighborhood models部分内容(其余部分会陆续补充上来)。

    koren论文中用到netflix 数据集, 过于大, 在普通的pc机上运行时间很长很长。考虑到写文章目地主要是已介绍总结方法为主,所以采用Movielens 数据集。

    变量介绍(涉及到的其他变量可以参看上面提到的相关文章):


    利用pearson相关系数,求i,j之间的相关性。


    文章中提到shrunk correlation coefficient(收缩的相关系数),收缩后pearson相关系数作为i,j相似性,后面会通过实践证明收缩的效果会更好。

    预测值:

    系统评判标准:RMSE, MAE



    系统采用5-fold cross-validation(movielens数据集中已经默认划分好了)

     

    注: 用SGD来训练出最优的用户和项的偏置值,后续会补充完整。

     

    详细代码实现:

    ''''' 
    Created on Dec 16, 2012 
     
    @Author: Dennis Wu 
    @E-mail: hansel.zh@gmail.com 
    @Homepage: http://blog.csdn.net/wuzh670 
    @Weibo: http://weibo.com/hansel 
     
    Data set download from : http://www.grouplens.org/system/files/ml-100k.zip 
    '''  
    from operator import itemgetter, attrgetter  
    from math import sqrt,fabs,log  
    import random  
      
    def load_data(filename_train, filename_test):  
      
        train = {}  
        test = {}  
          
        for line in open(filename_train):  
            (userId, itemId, rating, timestamp) = line.strip().split('	')  
            train.setdefault(userId,{})  
            train[userId][itemId] = float(rating)  
      
        for line in open(filename_test):  
            (userId, itemId, rating, timestamp) = line.strip().split('	')  
            test.setdefault(userId,{})  
            test[userId][itemId] = float(rating)  
      
        return train, test  
      
    def initialBias(train, userNum, movieNum, mean):  
      
        bu = {}  
        bi = {}  
        biNum = {}  
        buNum = {}  
          
        u = 1  
        while u < (userNum+1):  
            su = str(u)  
            for i in train[su].keys():  
                bi.setdefault(i,0)  
                biNum.setdefault(i,0)  
                bi[i] += (train[su][i] - mean)  
                biNum[i] += 1  
            u += 1  
              
        i = 1  
        while i < (movieNum+1):  
            si = str(i)  
            biNum.setdefault(si,0)  
            if biNum[si] >= 1:  
                bi[si] = bi[si]*1.0/(biNum[si]+25)  
            else:  
                bi[si] = 0.0  
            i += 1  
      
        u = 1  
        while u < (userNum+1):  
            su = str(u)  
            for i in train[su].keys():  
                bu.setdefault(su,0)  
                buNum.setdefault(su,0)  
                bu[su] += (train[su][i] - mean - bi[i])  
                buNum[su] += 1  
            u += 1  
              
        u = 1  
        while u < (userNum+1):  
            su = str(u)  
            buNum.setdefault(su,0)  
            if buNum[su] >= 1:  
                bu[su] = bu[su]*1.0/(buNum[su]+10)  
            else:  
                bu[su] = 0.0  
            u += 1  
        return bu, bi  
      
    def initial(train, userNum, movieNum):  
      
        average = {}  
        Sij = {}  
        mean = 0  
        num = 0  
        N = {}  
        for u in train.keys():  
            for i in train[u].keys():  
                mean += train[u][i]  
                num += 1  
                average.setdefault(i,0)  
                average[i] += train[u][i]  
                N.setdefault(i,0)  
                N[i] += 1  
                Sij.setdefault(i,{})  
                for j in train[u].keys():  
                    if i == j:  
                        continue  
                    Sij[i].setdefault(j,[])  
                    Sij[i][j].append(u)  
      
        mean = mean / num  
        for i in average.keys():  
            average[i] = average[i] / N[i]  
              
        pearson = {}  
        itemSim = {}  
        for i in Sij.keys():  
            pearson.setdefault(i,{})  
            itemSim.setdefault(i,{})  
            for j in Sij[i].keys():  
                pearson[i][j] = 1  
                part1 = 0  
                part2 = 0  
                part3 = 0  
                for u in Sij[i][j]:  
                    part1 += (train[u][i] - average[i]) * (train[u][j] - average[j])  
                    part2 += pow(train[u][i] - average[i], 2)  
                    part3 += pow(train[u][j] - average[j], 2)  
                if part1 != 0:  
                    pearson[i][j] = part1 / sqrt(part2 * part3)  
                itemSim[i][j] = fabs(pearson[i][j] * len(Sij[i][j]) / (len(Sij[i][j]) + 100))  
      
        # initial user and item Bias, respectly  
        bu, bi = initialBias(train, userNum, movieNum, mean)  
      
        return itemSim, mean, average, bu, bi      
        
    def neighborhoodModels(train, test, itemSim, mean, average, bu, bi):  
          
        pui = {}  
        rmse = 0.0  
        mae = 0.0  
        num = 0  
        for u in test.keys():  
            pui.setdefault(u,{})  
            for i in test[u].keys():  
                pui[u][i] = mean + bu[u] + bi[i]  
                stat = 0  
                stat2 = 0  
                for j in train[u].keys():  
                    if itemSim.has_key(i) and itemSim[i].has_key(j):  
                        stat += (train[u][j] - mean - bu[u] - bi[j]) * itemSim[i][j]  
                        stat2 += itemSim[i][j]  
                if stat > 0:  
                    pui[u][i] += stat * 1.0 / stat2  
                rmse += pow((pui[u][i] - test[u][i]), 2)  
                mae += fabs(pui[u][i] - test[u][i])  
                num += 1  
        rmse = sqrt(rmse*1.0 / num)  
        mae = mae * 1.0 / num  
          
        return rmse, mae  
      
    if __name__ == "__main__":  
      
        i = 1  
        sumRmse = 0.0  
        sumMae = 0.0  
        while i <= 5:  
      
            # load data  
            filename_train = 'data/u' + str(i) + '.base'  
            filename_test = 'data/u' + str(i) + '.test'  
            train, test = load_data(filename_train, filename_test)  
      
            # initial variables  
            itemSim, mean, average, bu, bi = initial(train, 943, 1682)  
      
            # neighborhoodModels  
            rmse, mae = neighborhoodModels(train, test, itemSim, mean, average, bu, bi)  
            print 'cross-validation %d:  rmse: %s     mae: %s' % (i, rmse, mae)  
              
            sumRmse += rmse  
            sumMae += mae  
            i += 1  
              
        print 'neighborhood models final results:  Rmse: %s      Mae: %s' % (sumRmse/5, sumMae/5)  
     

    实验结果: 

    注:第一个结果是没有使用收缩的pearson相关系数跑出的结果;第二个结果则是使用收缩的相关系数跑出的结果。



  • 相关阅读:
    良心之作送你几个Xsheel使用小技巧
    面试问Redis集群,被虐的不行了......
    一文搞定Redis五大数据类型及应用场景
    写给大忙人的Redis主从复制,花费五分钟让你面试不尴尬
    Redis删除策略和逐出策略
    一文带你了解Redis持久化完整版本
    MySQL--创建计算字段
    MySQL语句与正则表达式
    SQLZOO练习二--SELECT from Nobel Tutorial
    SQLZOO练习(一)SELECT BASICS,SELECT form world
  • 原文地址:https://www.cnblogs.com/gt123/p/3451793.html
Copyright © 2011-2022 走看看