zoukankan      html  css  js  c++  java
  • 推荐系统介绍:(协同过滤)—Intro to Recommender Systems: Collaborative Filtering

    本文试验前期准备:

    1. MovieLens  ml-100k数据集
    2. Jupyter notebook
    3. themoviedb.org API key

     本文试验内容翻译自:http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/

     

    1. 添加python引用
      import numpy as np
      import pandas as pd
    2. 进入MovieLens  ml-100k数据存放目录
      cd F:MasterMachineLearningkNNml-100k
    3. 读取数据:u.data每行数据分为userid,itemid,rating,时间戳四部分
      names = ['user_id', 'item_id', 'rating', 'timestamp']
      df = pd.read_csv('u.data', sep='	', names=names)
      df.head()
       user_iditem_idratingtimestamp
      0 196 242 3 881250949
      1 186 302 3 891717742
      2 22 377 1 878887116
      3 244 51 2 880606923
      4 166 346 1 886397596
    4. 统计文件中用户总数与电影总数
      n_users = df.user_id.unique().shape[0]
      n_items = df.item_id.unique().shape[0]
      print str(n_users) + ' users'
      print str(n_items) + ' items'
      943 users
      1682 items
    5. 构造 用户-电影评分矩阵
      ratings = np.zeros((n_users, n_items))
      for row in df.itertuples():
          ratings[row[1]-1, row[2]-1] = row[3]
      ratings
      array([[ 5.,  3.,  4., ...,  0.,  0.,  0.],
             [ 4.,  0.,  0., ...,  0.,  0.,  0.],
             [ 0.,  0.,  0., ...,  0.,  0.,  0.],
             ..., 
             [ 5.,  0.,  0., ...,  0.,  0.,  0.],
             [ 0.,  0.,  0., ...,  0.,  0.,  0.],
             [ 0.,  5.,  0., ...,  0.,  0.,  0.]])
    6. 计算数据稀疏度
      sparsity = float(len(ratings.nonzero()[0]))
      sparsity /= (ratings.shape[0] * ratings.shape[1])
      sparsity *= 100
      print 'Sparsity: {:4.2f}%'.format(sparsity)

      Sparsity: 6.30% 
      数据稀疏度:6.3%

    7.  数据稀疏度为6.3%,943个user,1682个item,每个用户平均需要做出100条评论,随机抽取10%数据,将数据分为训练集与测试机两部分
      def train_test_split(ratings):
          test = np.zeros(ratings.shape)
          train = ratings.copy()
          for user in xrange(ratings.shape[0]):
              test_ratings = np.random.choice(ratings[user, :].nonzero()[0], 
                                              size=10, 
                                              replace=False)
              train[user, test_ratings] = 0.
              test[user, test_ratings] = ratings[user, test_ratings]
              
          # Test and training are truly disjoint
          assert(np.all((train * test) == 0)) 
          return train, test
      train, test = train_test_split(ratings)
    8. 计算user或item的余弦相似性可以用代码通过for循环实现,但是这样Python代码会运行非常慢,这里可以使用NumPy的科学计算函数来表达方程式,提高计算速度
      def slow_similarity(ratings, kind='user'):
          if kind == 'user':
              axmax = 0
              axmin = 1
          elif kind == 'item':
              axmax = 1
              axmin = 0
          sim = np.zeros((ratings.shape[axmax], ratings.shape[axmax]))
          for u in xrange(ratings.shape[axmax]):
              for uprime in xrange(ratings.shape[axmax]):
                  rui_sqrd = 0.
                  ruprimei_sqrd = 0.
                  for i in xrange(ratings.shape[axmin]):
                      sim[u, uprime] = ratings[u, i] * ratings[uprime, i]
                      rui_sqrd += ratings[u, i] ** 2
                      ruprimei_sqrd += ratings[uprime, i] ** 2
                  sim[u, uprime] /= rui_sqrd * ruprimei_sqrd
          return sim
      
      def fast_similarity(ratings, kind='user', epsilon=1e-9):
          # epsilon -> small number for handling dived-by-zero errors
          if kind == 'user':
              sim = ratings.dot(ratings.T) + epsilon
          elif kind == 'item':
              sim = ratings.T.dot(ratings) + epsilon
          norms = np.array([np.sqrt(np.diagonal(sim))])
          return (sim / norms / norms.T)
      %timeit fast_similarity(train, kind='user')
      1 loop, best of 3: 171 ms per loop
    9.  分别计算user相似性和item相似性,并输出item相似性矩阵的前4行

      user_similarity = fast_similarity(train, kind='user')
      item_similarity = fast_similarity(train, kind='item')
      print item_similarity[:4, :4]
      [[ 1.          0.42176871  0.3440934   0.4551558 ]
       [ 0.42176871  1.          0.2889324   0.48827863]
       [ 0.3440934   0.2889324   1.          0.33718518]
       [ 0.4551558   0.48827863  0.33718518  1.        ]]
    10.  预测评分,predict_fast_simple使用NumPy数学函数,计算更块

      def predict_slow_simple(ratings, similarity, kind='user'):
          pred = np.zeros(ratings.shape)
          if kind == 'user':
              for i in xrange(ratings.shape[0]):
                  for j in xrange(ratings.shape[1]):
                      pred[i, j] = similarity[i, :].dot(ratings[:, j])
                                   /np.sum(np.abs(similarity[i, :]))
              return pred
          elif kind == 'item':
              for i in xrange(ratings.shape[0]):
                  for j in xrange(ratings.shape[1]):
                      pred[i, j] = similarity[j, :].dot(ratings[i, :].T)
                                   /np.sum(np.abs(similarity[j, :]))
      
              return pred
      
      def predict_fast_simple(ratings, similarity, kind='user'):
          if kind == 'user':
              return similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).T
          elif kind == 'item':
              return ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
      %timeit predict_slow_simple(train, user_similarity, kind='user')
      1 loop, best of 3: 1min 52s per loop
      %timeit predict_fast_simple(train, user_similarity, kind='user')
      1 loop, best of 3: 279 ms per loop 
    11.  使用sklearn计算MSE,首先去除数据矩阵中的无效0值,然后直接调用sklearn里面的mean_squared_error函数计算MSE

      from sklearn.metrics import mean_squared_error
      
      def get_mse(pred, actual):
          # Ignore nonzero terms.
          pred = pred[actual.nonzero()].flatten()
          actual = actual[actual.nonzero()].flatten()
          return mean_squared_error(pred, actual)
      item_prediction = predict_fast_simple(train, item_similarity, kind='item')
      user_prediction = predict_fast_simple(train, user_similarity, kind='user')
      
      print 'User-based CF MSE: ' + str(get_mse(user_prediction, test))
      print 'Item-based CF MSE: ' + str(get_mse(item_prediction, test))
      User-based CF MSE: 8.44170489251
      Item-based CF MSE: 11.5717812485
    12.  为提高预测的MSE,可以只考虑使用与目标用户最相似的k个用户的数据,进行Top-k预测并进行MSE计算

      def predict_topk(ratings, similarity, kind='user', k=40):
          pred = np.zeros(ratings.shape)
          if kind == 'user':
              for i in xrange(ratings.shape[0]):
                  top_k_users = [np.argsort(similarity[:,i])[:-k-1:-1]]
                  for j in xrange(ratings.shape[1]):
                      pred[i, j] = similarity[i, :][top_k_users].dot(ratings[:, j][top_k_users]) 
                      pred[i, j] /= np.sum(np.abs(similarity[i, :][top_k_users]))
          if kind == 'item':
              for j in xrange(ratings.shape[1]):
                  top_k_items = [np.argsort(similarity[:,j])[:-k-1:-1]]
                  for i in xrange(ratings.shape[0]):
                      pred[i, j] = similarity[j, :][top_k_items].dot(ratings[i, :][top_k_items].T) 
                      pred[i, j] /= np.sum(np.abs(similarity[j, :][top_k_items]))        
          
          return pred
      pred = predict_topk(train, user_similarity, kind='user', k=40)
      print 'Top-k User-based CF MSE: ' + str(get_mse(pred, test))
      
      pred = predict_topk(train, item_similarity, kind='item', k=40)
      print 'Top-k Item-based CF MSE: ' + str(get_mse(pred, test))

      计算结果为:

      Top-k User-based CF MSE: 6.47059807493
      Top-k Item-based CF MSE: 7.75559095568

      相比之前的方法,MSE已经降低了不少。

    13. 为进一步降低MSE,这里尝试使用不同的k值寻找最小的MSE,使用matplotlib 可视化输出结果
      k_array = [5, 15, 30, 50, 100, 200]
      user_train_mse = []
      user_test_mse = []
      item_test_mse = []
      item_train_mse = []
      
      def get_mse(pred, actual):
          pred = pred[actual.nonzero()].flatten()
          actual = actual[actual.nonzero()].flatten()
          return mean_squared_error(pred, actual)
      
      for k in k_array:
          user_pred = predict_topk(train, user_similarity, kind='user', k=k)
          item_pred = predict_topk(train, item_similarity, kind='item', k=k)
          
          user_train_mse += [get_mse(user_pred, train)]
          user_test_mse += [get_mse(user_pred, test)]
          
          item_train_mse += [get_mse(item_pred, train)]
          item_test_mse += [get_mse(item_pred, test)]  
      %matplotlib inline
      import matplotlib.pyplot as plt
      import seaborn as sns
      sns.set()
      
      pal = sns.color_palette("Set2", 2)
      
      plt.figure(figsize=(8, 8))
      plt.plot(k_array, user_train_mse, c=pal[0], label='User-based train', alpha=0.5, linewidth=5)
      plt.plot(k_array, user_test_mse, c=pal[0], label='User-based test', linewidth=5)
      plt.plot(k_array, item_train_mse, c=pal[1], label='Item-based train', alpha=0.5, linewidth=5)
      plt.plot(k_array, item_test_mse, c=pal[1], label='Item-based test', linewidth=5)
      plt.legend(loc='best', fontsize=20)
      plt.xticks(fontsize=16);
      plt.yticks(fontsize=16);
      plt.xlabel('k', fontsize=30);
      plt.ylabel('MSE', fontsize=30);
       
      从图中可以看出,在测试数据集中,k为15和50时分别产生一个最小值对基于用户和基于项目的协同过滤
    14.  计算无偏置下均方根误差MSE
      def predict_nobias(ratings, similarity, kind='user'):
          if kind == 'user':
              user_bias = ratings.mean(axis=1)
              ratings = (ratings - user_bias[:, np.newaxis]).copy()
              pred = similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).T
              pred += user_bias[:, np.newaxis]
          elif kind == 'item':
              item_bias = ratings.mean(axis=0)
              ratings = (ratings - item_bias[np.newaxis, :]).copy()
              pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
              pred += item_bias[np.newaxis, :]
              
          return pred
      user_pred = predict_nobias(train, user_similarity, kind='user')
      print 'Bias-subtracted User-based CF MSE: ' + str(get_mse(user_pred, test))
      
      item_pred = predict_nobias(train, item_similarity, kind='item')
      print 'Bias-subtracted Item-based CF MSE: ' + str(get_mse(item_pred, test))
      Bias-subtracted User-based CF MSE: 8.67647634245
      Bias-subtracted Item-based CF MSE: 9.71148412222



    15. 将Top-k和偏置消除算法结合起来,计算基于User的和基于Item的MSE,并分别取k=5,15,30,50,100,200,将计算的MSE结果运用matplotlib 可视化输出
      def predict_topk_nobias(ratings, similarity, kind='user', k=40):
          pred = np.zeros(ratings.shape)
          if kind == 'user':
              user_bias = ratings.mean(axis=1)
              ratings = (ratings - user_bias[:, np.newaxis]).copy()
              for i in xrange(ratings.shape[0]):
                  top_k_users = [np.argsort(similarity[:,i])[:-k-1:-1]]
                  for j in xrange(ratings.shape[1]):
                      pred[i, j] = similarity[i, :][top_k_users].dot(ratings[:, j][top_k_users]) 
                      pred[i, j] /= np.sum(np.abs(similarity[i, :][top_k_users]))
              pred += user_bias[:, np.newaxis]
          if kind == 'item':
              item_bias = ratings.mean(axis=0)
              ratings = (ratings - item_bias[np.newaxis, :]).copy()
              for j in xrange(ratings.shape[1]):
                  top_k_items = [np.argsort(similarity[:,j])[:-k-1:-1]]
                  for i in xrange(ratings.shape[0]):
                      pred[i, j] = similarity[j, :][top_k_items].dot(ratings[i, :][top_k_items].T) 
                      pred[i, j] /= np.sum(np.abs(similarity[j, :][top_k_items])) 
              pred += item_bias[np.newaxis, :]
              
          return pred
      k_array = [5, 15, 30, 50, 100, 200]
      user_train_mse = []
      user_test_mse = []
      item_test_mse = []
      item_train_mse = []
      
      for k in k_array:
          user_pred = predict_topk_nobias(train, user_similarity, kind='user', k=k)
          item_pred = predict_topk_nobias(train, item_similarity, kind='item', k=k)
          
          user_train_mse += [get_mse(user_pred, train)]
          user_test_mse += [get_mse(user_pred, test)]
          
          item_train_mse += [get_mse(item_pred, train)]
          item_test_mse += [get_mse(item_pred, test)]  
      In [29]:
      pal = sns.color_palette("Set2", 2)
      
      plt.figure(figsize=(8, 8))
      plt.plot(k_array, user_train_mse, c=pal[0], label='User-based train', alpha=0.5, linewidth=5)
      plt.plot(k_array, user_test_mse, c=pal[0], label='User-based test', linewidth=5)
      plt.plot(k_array, item_train_mse, c=pal[1], label='Item-based train', alpha=0.5, linewidth=5)
      plt.plot(k_array, item_test_mse, c=pal[1], label='Item-based test', linewidth=5)
      plt.legend(loc='best', fontsize=20)
      plt.xticks(fontsize=16);
      plt.yticks(fontsize=16);
      plt.xlabel('k', fontsize=30);
      plt.ylabel('MSE', fontsize=30);



    16. 导入requests引用,通过requests.get方法获取链接地址
      import requests
      import json
      
      response = requests.get('http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)')
      print response.url.split('/')[-2]
      Movie ID 输出结果:tt0114709
    17. 这里需要使用themoviedb的API,通过查询themoviedb.org的API获取指定movie id 的海报文件存放路径
      # Get base url filepath structure. w185 corresponds to size of movie poster.
      headers = {'Accept': 'application/json'}
      payload = {'api_key': '这里填入你的API'} 
      response = requests.get("http://api.themoviedb.org/3/configuration", params=payload, headers=headers)
      response = json.loads(response.text)
      base_url = response['images']['base_url'] + 'w185'
      
      def get_poster(imdb_url, base_url):
          # Get IMDB movie ID
          response = requests.get(imdb_url)
          movie_id = response.url.split('/')[-2]
          
          # Query themoviedb.org API for movie poster path.
          movie_url = 'http://api.themoviedb.org/3/movie/{:}/images'.format(movie_id)
          headers = {'Accept': 'application/json'}
          payload = {'api_key': '这里填入你的API'} 
          response = requests.get(movie_url, params=payload, headers=headers)
          try:
              file_path = json.loads(response.text)['posters'][0]['file_path']
          except:
              # IMDB movie ID is sometimes no good. Need to get correct one.
              movie_title = imdb_url.split('?')[-1].split('(')[0]
              payload['query'] = movie_title
              response = requests.get('http://api.themoviedb.org/3/search/movie', params=payload, headers=headers)
              movie_id = json.loads(response.text)['results'][0]['id']
              payload.pop('query', None)
              movie_url = 'http://api.themoviedb.org/3/movie/{:}/images'.format(movie_id)
              response = requests.get(movie_url, params=payload, headers=headers)
              file_path = json.loads(response.text)['posters'][0]['file_path']
              
          return base_url + file_path
      from IPython.display import Image
      from IPython.display import display
      
      toy_story = 'http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)'
      Image(url=get_poster(toy_story, base_url))

      直接输出了电影的海报图片

    18. 加载MovieLens中u.data文件中的电影信息,根据给定的电影信息,计算最相似的k个电影,输出它们的海报

      # Load in movie data
      idx_to_movie = {}
      with open('u.item', 'r') as f:
          for line in f.readlines():
              info = line.split('|')
              idx_to_movie[int(info[0])-1] = info[4]
              
      def top_k_movies(similarity, mapper, movie_idx, k=6):
          return [mapper[x] for x in np.argsort(similarity[movie_idx,:])[:-k-1:-1]]
      idx = 0 # Toy Story
      movies = top_k_movies(item_similarity, idx_to_movie, idx)
      posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
      display(*posters)


    19. 输出id为1的电影(GoldenEye)的最相似的k(k默认为6)部电影海报
      idx = 1 # GoldenEye
      movies = top_k_movies(item_similarity, idx_to_movie, idx)
      posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
      display(*posters)
    20. 输出id为2的电影(Muppet Treasure Island)的最相似的k(k默认为6)部电影海报
      idx = 20 # Muppet Treasure Island
      movies = top_k_movies(item_similarity, idx_to_movie, idx)
      posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
      display(*posters)
    21. 输出id为20的电影(Muppet Treasure Island)的最相似的k(k默认为6)部电影海报
      idx = 20 # Muppet Treasure Island
      movies = top_k_movies(item_similarity, idx_to_movie, idx)
      posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
      display(*posters)
    22. 输出id为40的电影(Billy Madison)的最相似的k(k默认为6)部电影海报
      idx = 40 # Billy Madison
      movies = top_k_movies(item_similarity, idx_to_movie, idx)
      posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
      display(*posters)
    23. 有时候现在这个的推荐结果似乎并不总是很好,Star Wars最相似的电影是Toy Story?Star Wars这类很受欢迎的电影在系统中预测评分很高,可以考虑运用一个不同的相似度度量方法——pearson相关度来移除一些偏置
      from sklearn.metrics import pairwise_distances
      # Convert from distance to similarity
      item_correlation = 1 - pairwise_distances(train.T, metric='correlation')
      item_correlation[np.isnan(item_correlation)] = 0.
    24. 再此分别对id为0,1,20,40的电影进行最相似的k部电影预测
      idx = 0 # Toy Story
      movies = top_k_movies(item_correlation, idx_to_movie, idx)
      posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
      display(*posters)
      idx = 1 # GoldenEye
      movies = top_k_movies(item_correlation, idx_to_movie, idx)
      posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
      display(*posters)
      idx = 20 # Muppet Treasure Island
      movies = top_k_movies(item_correlation, idx_to_movie, idx)
      posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
      display(*posters)
      idx = 40 # Billy Madison
      movies = top_k_movies(item_correlation, idx_to_movie, idx)
      posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
      display(*posters)

    sim(u,u)=cos(θ)=ru˙rururu=iruiruiir2uiir2ui

  • 相关阅读:
    iOS 自动化测试踩坑(二):Appium 架构原理、环境命令、定位方式
    干货 | 掌握 Selenium 元素定位,解决 Web 自动化测试痛点
    代理技术哪家强?接口 Mock 测试首选 Charles
    浅谈MVC缓存
    PetaPoco 快速上手
    解释器模式(26)
    享元模式(25)
    中介者模式(24)
    职责链模式(23)
    命令模式(22)
  • 原文地址:https://www.cnblogs.com/190260995xixi/p/6045425.html
Copyright © 2011-2022 走看看