zoukankan      html  css  js  c++  java
  • 基于用户的最近邻协同过滤算法(MovieLens数据集)

     

    基于用户的最近邻算法(User-Based Neighbor Algorithms),是一种非概率性的协同过滤算法,也是推荐系统中最最古老,最著名的算法。

    我们称那些兴趣相似的用户为邻居,如果用户n相似于用户u,我们就说n是u的一个邻居。起初算法,对于未知目标的预测是根据该用户的相似用户的评分作出预测的。

    本文中运用的是MovieLens数据集,关于这个数据集的介绍可以参看http://www.grouplens.org/node/73

    算法主要包括两个步骤:

    (1). 找到与用户兴趣相似的用户(邻居)集合。

    (2). 根据这个邻居集合,计算出该用户对未曾评分的物品的预测评分。并列出获得最高的预测评分N项物品,推荐给该用户。

     

    本文,用皮尔逊相关系数(pearon correlation coefficient)计算用户之间的相似性。如formula1

    计算用户u对物品i的预测值,使用的formula2 

     

    formula1:

     

     

    formula-2:

     

     

    具体实现代码如下:

     

    ''''' 
    Created on Nov 17, 2012 
     
     
    @Author: Dennis Wu 
    @E-mail: hansel.zh@gmail.com 
    @Homepage: http://blog.csdn.net/wuzh670 
     
     
    Data set download from : http://www.grouplens.org/system/files/ml-100k.zip 
     
     
    MovieLens data sets were collected by the GroupLens Research Project 
    at the University of Minnesota.The data was collected through the MovieLens web site 
    (movielens.umn.edu) during the seven-month period from September 19th, 
    1997 through April 22nd, 1998. 
     
     
    This data set consists of: 
        * 100,000 ratings (1-5) from 943 users on 1682 movies. 
        * Each user has rated at least 20 movies. 
        * Simple demographic info for the users  
     
     
    u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items. 
                  Each user has rated at least 20 movies.  Users and items are 
                  numbered consecutively from 1.  The data is randomly 
                  ordered. This is a tab separated list of 
                  user id | item id | rating | timestamp. 
                  The time stamps are unix seconds since 1/1/1970 UTC 
    u.item     -- Information about the items (movies); this is a tab separated 
                  list of 
                  movie id | movie title | release date | video release date | 
                  IMDb URL | unknown | Action | Adventure | Animation | 
                  Children's | Comedy | Crime | Documentary | Drama | Fantasy | 
                  Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | 
                  Thriller | War | Western | 
                  The last 19 fields are the genres, a 1 indicates the movie 
                  is of that genre, a 0 indicates it is not; movies can be in 
                  several genres at once. 
                  The movie ids are the ones used in the u.data data set. 
    '''  
      
      
    from operator import itemgetter, attrgetter  
    from math import sqrt  
      
      
    def load_data():  
          
        filename_user_movie = 'data/u.data'  
        filename_movieInfo = 'data/u.item'  
      
      
        user_movie = {}  
        for line in open(filename_user_movie):  
            (userId, itemId, rating, timestamp) = line.strip().split('	')  
            user_movie.setdefault(userId,{})  
            user_movie[userId][itemId] = float(rating)  
              
        movies = {}  
        for line in open(filename_movieInfo):  
            (movieId, movieTitle) = line.split('|')[0:2]  
            movies[movieId] = movieTitle  
          
        return user_movie, movies  
      
      
    def average_rating(user):  
        average = 0  
        for u in user_movie[user].keys():  
            average += user_movie[user][u]  
        average = average * 1.0 / len(user_movie[user].keys())  
        return average  
      
      
    def calUserSim(user_movie):  
      
      
        # build inverse table for movie_user  
        movie_user = {}  
        for ukey in user_movie.keys():  
            for mkey in user_movie[ukey].keys():  
                if mkey not in movie_user:  
                    movie_user[mkey] = []  
                movie_user[mkey].append(ukey)  
      
      
        # calculated co-rated movies between users  
        C = {}  
        for movie, users in movie_user.items():  
            for u in users:  
                C.setdefault(u,{})  
                for n in users:  
                    if u == n:  
                        continue  
                    C[u].setdefault(n,[])  
                    C[u][n].append(movie)  
                      
      
      
        # calculate user similarity (perason correlation)  
        userSim = {}  
        for u in C.keys():  
              
            for n in C[u].keys():  
                  
                userSim.setdefault(u,{})  
                userSim[u].setdefault(n,0)  
              
                average_u_rate = average_rating(u)  
                average_n_rate = average_rating(n)  
                  
                part1 = 0  
                part2 = 0  
                part3 = 0  
                for m in C[u][n]:  
      
      
                    part1 += (user_movie[u][m]-average_u_rate)*(user_movie[n][m]-average_n_rate)*1.0  
                    part2 += pow(user_movie[u][m]-average_u_rate, 2)*1.0  
                    part3 += pow(user_movie[n][m]-average_n_rate, 2)*1.0  
                      
                part2 = sqrt(part2)  
                part3 = sqrt(part3)  
                if part2 == 0:  
                    part2 = 0.001  
                if part3 == 0:  
                    part3 = 0.001   
                userSim[u][n] = part1 / (part2 * part3)     
        return userSim  
      
      
    def getRecommendations(user, user_movie, movies, userSim, N):  
        pred = {}  
        interacted_items = user_movie[user].keys()  
        average_u_rate = average_rating(user)  
        sumUserSim = 0  
        for n, nuw in sorted(userSim[user].items(),key=itemgetter(1),reverse=True)[0:N]:  
            average_n_rate = average_rating(n)  
            for i, nrating in user_movie[n].items():  
                # filter movies user interacted before  
                if i in interacted_items:  
                    continue  
                pred.setdefault(i,0)  
                pred[i] += nuw * (nrating - average_n_rate)  
            sumUserSim += nuw  
      
      
        for i, rating in pred.items():  
            pred[i] = average_u_rate + (pred[i]*1.0) / sumUserSim  
              
        # top-10 pred  
        pred = sorted(pred.items(), key=itemgetter(1), reverse=True)[0:10]  
        return pred    
      
      
    if __name__ == "__main__":  
      
      
      
      
        # load data  
        user_movie, movies = load_data()  
      
      
        # Calculate user similarity  
        userSim = calUserSim(user_movie)  
      
      
        # Recommend  
        pred = getRecommendations('182', user_movie, movies, userSim, 20)  
      
      
        # display recommend result (top-10 results)  
        for i, rating in pred:  
            print 'film: %s,  rating: %s' % (movies[i], rating)  
          
    References

     

    1. J.Ben Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen : Collaborative Filtering Recommender System 

    2. 项亮: 推荐系统实践 2012





  • 相关阅读:
    所有时间测试函数
    时间函数应用 time
    50个c/c++源代码网站
    ASN.1详解
    SNMP协议
    SNMP协议详解
    大数据需要建立规则和标准
    常用的三层架构设计
    构建大型网站架构十步骤
    iOS 应用程序内部国际化,不跟随系统语言
  • 原文地址:https://www.cnblogs.com/gt123/p/3451561.html
Copyright © 2011-2022 走看看