本文试验前期准备:
- MovieLens ml-100k数据集
- Jupyter notebook
- themoviedb.org API key
本文试验内容翻译自:http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/
- 添加python引用
import numpy as np import pandas as pd
- 进入MovieLens ml-100k数据存放目录
cd F:MasterMachineLearningkNNml-100k
- 读取数据:u.data每行数据分为userid,itemid,rating,时间戳四部分
names = ['user_id', 'item_id', 'rating', 'timestamp'] df = pd.read_csv('u.data', sep=' ', names=names) df.head()
user_id item_id rating timestamp 0 196 242 3 881250949 1 186 302 3 891717742 2 22 377 1 878887116 3 244 51 2 880606923 4 166 346 1 886397596 - 统计文件中用户总数与电影总数
n_users = df.user_id.unique().shape[0] n_items = df.item_id.unique().shape[0] print str(n_users) + ' users' print str(n_items) + ' items'
943 users 1682 items
- 构造 用户-电影评分矩阵
ratings = np.zeros((n_users, n_items)) for row in df.itertuples(): ratings[row[1]-1, row[2]-1] = row[3] ratings
array([[ 5., 3., 4., ..., 0., 0., 0.], [ 4., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], ..., [ 5., 0., 0., ..., 0., 0., 0.], [ 0., 0., 0., ..., 0., 0., 0.], [ 0., 5., 0., ..., 0., 0., 0.]])
- 计算数据稀疏度
sparsity = float(len(ratings.nonzero()[0])) sparsity /= (ratings.shape[0] * ratings.shape[1]) sparsity *= 100 print 'Sparsity: {:4.2f}%'.format(sparsity)
Sparsity: 6.30%
数据稀疏度:6.3% - 数据稀疏度为6.3%,943个user,1682个item,每个用户平均需要做出100条评论,随机抽取10%数据,将数据分为训练集与测试机两部分
def train_test_split(ratings): test = np.zeros(ratings.shape) train = ratings.copy() for user in xrange(ratings.shape[0]): test_ratings = np.random.choice(ratings[user, :].nonzero()[0], size=10, replace=False) train[user, test_ratings] = 0. test[user, test_ratings] = ratings[user, test_ratings] # Test and training are truly disjoint assert(np.all((train * test) == 0)) return train, test
train, test = train_test_split(ratings)
- 计算user或item的余弦相似性可以用代码通过for循环实现,但是这样Python代码会运行非常慢,这里可以使用NumPy的科学计算函数来表达方程式,提高计算速度
def slow_similarity(ratings, kind='user'): if kind == 'user': axmax = 0 axmin = 1 elif kind == 'item': axmax = 1 axmin = 0 sim = np.zeros((ratings.shape[axmax], ratings.shape[axmax])) for u in xrange(ratings.shape[axmax]): for uprime in xrange(ratings.shape[axmax]): rui_sqrd = 0. ruprimei_sqrd = 0. for i in xrange(ratings.shape[axmin]): sim[u, uprime] = ratings[u, i] * ratings[uprime, i] rui_sqrd += ratings[u, i] ** 2 ruprimei_sqrd += ratings[uprime, i] ** 2 sim[u, uprime] /= rui_sqrd * ruprimei_sqrd return sim def fast_similarity(ratings, kind='user', epsilon=1e-9): # epsilon -> small number for handling dived-by-zero errors if kind == 'user': sim = ratings.dot(ratings.T) + epsilon elif kind == 'item': sim = ratings.T.dot(ratings) + epsilon norms = np.array([np.sqrt(np.diagonal(sim))]) return (sim / norms / norms.T)
%timeit fast_similarity(train, kind='user')
1 loop, best of 3: 171 ms per loop
-
分别计算user相似性和item相似性,并输出item相似性矩阵的前4行
user_similarity = fast_similarity(train, kind='user') item_similarity = fast_similarity(train, kind='item') print item_similarity[:4, :4]
[[ 1. 0.42176871 0.3440934 0.4551558 ] [ 0.42176871 1. 0.2889324 0.48827863] [ 0.3440934 0.2889324 1. 0.33718518] [ 0.4551558 0.48827863 0.33718518 1. ]]
-
预测评分,predict_fast_simple使用NumPy数学函数,计算更块
def predict_slow_simple(ratings, similarity, kind='user'): pred = np.zeros(ratings.shape) if kind == 'user': for i in xrange(ratings.shape[0]): for j in xrange(ratings.shape[1]): pred[i, j] = similarity[i, :].dot(ratings[:, j]) /np.sum(np.abs(similarity[i, :])) return pred elif kind == 'item': for i in xrange(ratings.shape[0]): for j in xrange(ratings.shape[1]): pred[i, j] = similarity[j, :].dot(ratings[i, :].T) /np.sum(np.abs(similarity[j, :])) return pred def predict_fast_simple(ratings, similarity, kind='user'): if kind == 'user': return similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).T elif kind == 'item': return ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
%timeit predict_slow_simple(train, user_similarity, kind='user')
-
使用sklearn计算MSE,首先去除数据矩阵中的无效0值,然后直接调用sklearn里面的mean_squared_error函数计算MSE
from sklearn.metrics import mean_squared_error def get_mse(pred, actual): # Ignore nonzero terms. pred = pred[actual.nonzero()].flatten() actual = actual[actual.nonzero()].flatten() return mean_squared_error(pred, actual)
item_prediction = predict_fast_simple(train, item_similarity, kind='item') user_prediction = predict_fast_simple(train, user_similarity, kind='user') print 'User-based CF MSE: ' + str(get_mse(user_prediction, test)) print 'Item-based CF MSE: ' + str(get_mse(item_prediction, test))
User-based CF MSE: 8.44170489251 Item-based CF MSE: 11.5717812485
-
为提高预测的MSE,可以只考虑使用与目标用户最相似的k个用户的数据,进行Top-k预测并进行MSE计算
def predict_topk(ratings, similarity, kind='user', k=40): pred = np.zeros(ratings.shape) if kind == 'user': for i in xrange(ratings.shape[0]): top_k_users = [np.argsort(similarity[:,i])[:-k-1:-1]] for j in xrange(ratings.shape[1]): pred[i, j] = similarity[i, :][top_k_users].dot(ratings[:, j][top_k_users]) pred[i, j] /= np.sum(np.abs(similarity[i, :][top_k_users])) if kind == 'item': for j in xrange(ratings.shape[1]): top_k_items = [np.argsort(similarity[:,j])[:-k-1:-1]] for i in xrange(ratings.shape[0]): pred[i, j] = similarity[j, :][top_k_items].dot(ratings[i, :][top_k_items].T) pred[i, j] /= np.sum(np.abs(similarity[j, :][top_k_items])) return pred
pred = predict_topk(train, user_similarity, kind='user', k=40) print 'Top-k User-based CF MSE: ' + str(get_mse(pred, test)) pred = predict_topk(train, item_similarity, kind='item', k=40) print 'Top-k Item-based CF MSE: ' + str(get_mse(pred, test))
计算结果为:
Top-k User-based CF MSE: 6.47059807493 Top-k Item-based CF MSE: 7.75559095568
相比之前的方法,MSE已经降低了不少。
- 为进一步降低MSE,这里尝试使用不同的k值寻找最小的MSE,使用matplotlib 可视化输出结果
k_array = [5, 15, 30, 50, 100, 200] user_train_mse = [] user_test_mse = [] item_test_mse = [] item_train_mse = [] def get_mse(pred, actual): pred = pred[actual.nonzero()].flatten() actual = actual[actual.nonzero()].flatten() return mean_squared_error(pred, actual) for k in k_array: user_pred = predict_topk(train, user_similarity, kind='user', k=k) item_pred = predict_topk(train, item_similarity, kind='item', k=k) user_train_mse += [get_mse(user_pred, train)] user_test_mse += [get_mse(user_pred, test)] item_train_mse += [get_mse(item_pred, train)] item_test_mse += [get_mse(item_pred, test)]
%matplotlib inline import matplotlib.pyplot as plt import seaborn as sns sns.set() pal = sns.color_palette("Set2", 2) plt.figure(figsize=(8, 8)) plt.plot(k_array, user_train_mse, c=pal[0], label='User-based train', alpha=0.5, linewidth=5) plt.plot(k_array, user_test_mse, c=pal[0], label='User-based test', linewidth=5) plt.plot(k_array, item_train_mse, c=pal[1], label='Item-based train', alpha=0.5, linewidth=5) plt.plot(k_array, item_test_mse, c=pal[1], label='Item-based test', linewidth=5) plt.legend(loc='best', fontsize=20) plt.xticks(fontsize=16); plt.yticks(fontsize=16); plt.xlabel('k', fontsize=30); plt.ylabel('MSE', fontsize=30);
从图中可以看出,在测试数据集中,k为15和50时分别产生一个最小值对基于用户和基于项目的协同过滤 - 计算无偏置下均方根误差MSE
def predict_nobias(ratings, similarity, kind='user'): if kind == 'user': user_bias = ratings.mean(axis=1) ratings = (ratings - user_bias[:, np.newaxis]).copy() pred = similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).T pred += user_bias[:, np.newaxis] elif kind == 'item': item_bias = ratings.mean(axis=0) ratings = (ratings - item_bias[np.newaxis, :]).copy() pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)]) pred += item_bias[np.newaxis, :] return pred
user_pred = predict_nobias(train, user_similarity, kind='user') print 'Bias-subtracted User-based CF MSE: ' + str(get_mse(user_pred, test)) item_pred = predict_nobias(train, item_similarity, kind='item') print 'Bias-subtracted Item-based CF MSE: ' + str(get_mse(item_pred, test))
Bias-subtracted User-based CF MSE: 8.67647634245 Bias-subtracted Item-based CF MSE: 9.71148412222
- 将Top-k和偏置消除算法结合起来,计算基于User的和基于Item的MSE,并分别取k=5,15,30,50,100,200,将计算的MSE结果运用matplotlib 可视化输出
def predict_topk_nobias(ratings, similarity, kind='user', k=40): pred = np.zeros(ratings.shape) if kind == 'user': user_bias = ratings.mean(axis=1) ratings = (ratings - user_bias[:, np.newaxis]).copy() for i in xrange(ratings.shape[0]): top_k_users = [np.argsort(similarity[:,i])[:-k-1:-1]] for j in xrange(ratings.shape[1]): pred[i, j] = similarity[i, :][top_k_users].dot(ratings[:, j][top_k_users]) pred[i, j] /= np.sum(np.abs(similarity[i, :][top_k_users])) pred += user_bias[:, np.newaxis] if kind == 'item': item_bias = ratings.mean(axis=0) ratings = (ratings - item_bias[np.newaxis, :]).copy() for j in xrange(ratings.shape[1]): top_k_items = [np.argsort(similarity[:,j])[:-k-1:-1]] for i in xrange(ratings.shape[0]): pred[i, j] = similarity[j, :][top_k_items].dot(ratings[i, :][top_k_items].T) pred[i, j] /= np.sum(np.abs(similarity[j, :][top_k_items])) pred += item_bias[np.newaxis, :] return pred
k_array = [5, 15, 30, 50, 100, 200] user_train_mse = [] user_test_mse = [] item_test_mse = [] item_train_mse = [] for k in k_array: user_pred = predict_topk_nobias(train, user_similarity, kind='user', k=k) item_pred = predict_topk_nobias(train, item_similarity, kind='item', k=k) user_train_mse += [get_mse(user_pred, train)] user_test_mse += [get_mse(user_pred, test)] item_train_mse += [get_mse(item_pred, train)] item_test_mse += [get_mse(item_pred, test)] In [29]:
pal = sns.color_palette("Set2", 2) plt.figure(figsize=(8, 8)) plt.plot(k_array, user_train_mse, c=pal[0], label='User-based train', alpha=0.5, linewidth=5) plt.plot(k_array, user_test_mse, c=pal[0], label='User-based test', linewidth=5) plt.plot(k_array, item_train_mse, c=pal[1], label='Item-based train', alpha=0.5, linewidth=5) plt.plot(k_array, item_test_mse, c=pal[1], label='Item-based test', linewidth=5) plt.legend(loc='best', fontsize=20) plt.xticks(fontsize=16); plt.yticks(fontsize=16); plt.xlabel('k', fontsize=30); plt.ylabel('MSE', fontsize=30);
- 导入requests引用,通过requests.get方法获取链接地址
import requests import json response = requests.get('http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)') print response.url.split('/')[-2]
Movie ID 输出结果:tt0114709
- 这里需要使用themoviedb的API,通过查询themoviedb.org的API获取指定movie id 的海报文件存放路径
# Get base url filepath structure. w185 corresponds to size of movie poster. headers = {'Accept': 'application/json'} payload = {'api_key': '这里填入你的API'} response = requests.get("http://api.themoviedb.org/3/configuration", params=payload, headers=headers) response = json.loads(response.text) base_url = response['images']['base_url'] + 'w185' def get_poster(imdb_url, base_url): # Get IMDB movie ID response = requests.get(imdb_url) movie_id = response.url.split('/')[-2] # Query themoviedb.org API for movie poster path. movie_url = 'http://api.themoviedb.org/3/movie/{:}/images'.format(movie_id) headers = {'Accept': 'application/json'} payload = {'api_key': '这里填入你的API'} response = requests.get(movie_url, params=payload, headers=headers) try: file_path = json.loads(response.text)['posters'][0]['file_path'] except: # IMDB movie ID is sometimes no good. Need to get correct one. movie_title = imdb_url.split('?')[-1].split('(')[0] payload['query'] = movie_title response = requests.get('http://api.themoviedb.org/3/search/movie', params=payload, headers=headers) movie_id = json.loads(response.text)['results'][0]['id'] payload.pop('query', None) movie_url = 'http://api.themoviedb.org/3/movie/{:}/images'.format(movie_id) response = requests.get(movie_url, params=payload, headers=headers) file_path = json.loads(response.text)['posters'][0]['file_path'] return base_url + file_path
from IPython.display import Image from IPython.display import display toy_story = 'http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)' Image(url=get_poster(toy_story, base_url))
直接输出了电影的海报图片
- 加载MovieLens中u.data文件中的电影信息,根据给定的电影信息,计算最相似的k个电影,输出它们的海报
# Load in movie data idx_to_movie = {} with open('u.item', 'r') as f: for line in f.readlines(): info = line.split('|') idx_to_movie[int(info[0])-1] = info[4] def top_k_movies(similarity, mapper, movie_idx, k=6): return [mapper[x] for x in np.argsort(similarity[movie_idx,:])[:-k-1:-1]]
idx = 0 # Toy Story movies = top_k_movies(item_similarity, idx_to_movie, idx) posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies)
display(*posters)
- 输出id为1的电影(GoldenEye)的最相似的k(k默认为6)部电影海报
idx = 1 # GoldenEye movies = top_k_movies(item_similarity, idx_to_movie, idx) posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies) display(*posters)
- 输出id为2的电影(Muppet Treasure Island)的最相似的k(k默认为6)部电影海报
idx = 20 # Muppet Treasure Island movies = top_k_movies(item_similarity, idx_to_movie, idx) posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies) display(*posters)
- 输出id为20的电影(Muppet Treasure Island)的最相似的k(k默认为6)部电影海报
idx = 20 # Muppet Treasure Island movies = top_k_movies(item_similarity, idx_to_movie, idx) posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies) display(*posters)
- 输出id为40的电影(Billy Madison)的最相似的k(k默认为6)部电影海报
idx = 40 # Billy Madison movies = top_k_movies(item_similarity, idx_to_movie, idx) posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies) display(*posters)
- 有时候现在这个的推荐结果似乎并不总是很好,Star Wars最相似的电影是Toy Story?Star Wars这类很受欢迎的电影在系统中预测评分很高,可以考虑运用一个不同的相似度度量方法——pearson相关度来移除一些偏置
from sklearn.metrics import pairwise_distances # Convert from distance to similarity item_correlation = 1 - pairwise_distances(train.T, metric='correlation') item_correlation[np.isnan(item_correlation)] = 0.
- 再此分别对id为0,1,20,40的电影进行最相似的k部电影预测
idx = 0 # Toy Story movies = top_k_movies(item_correlation, idx_to_movie, idx) posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies) display(*posters)
idx = 1 # GoldenEye movies = top_k_movies(item_correlation, idx_to_movie, idx) posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies) display(*posters)
idx = 20 # Muppet Treasure Island movies = top_k_movies(item_correlation, idx_to_movie, idx) posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies) display(*posters)
idx = 40 # Billy Madison movies = top_k_movies(item_correlation, idx_to_movie, idx) posters = tuple(Image(url=get_poster(movie, base_url)) for movie in movies) display(*posters)
sim(u,u′)=cos(θ)=ru˙ru′∥ru∥∥ru′∥=∑iruiru′i∑ir2ui−−−−−√∑ir2u′i−−−−−√