zoukankan      html  css  js  c++  java
  • 协同过滤CF算法之入门

    数据规整

    首先将评分数据从 ratings.dat 中读出到一个 DataFrame 里:

    >>> import pandas as pd

    In [2]: import pandas as pd

    In [3]: df = pd.read_csv('2014-12-18.csv')

    In [4]: df.head()
    Out[4]:
    user_id item_id behavior_type user_geohash item_category hour
    0 100268421 284019855 1 95ridd7 1863 19
    1 109802727 56489946 1 NaN 8291 10
    2 109802727 56489946 1 NaN 8291 10
    3 109802727 266907147 1 99ctk96 9117

     

    >>> data = ratings.pivot(index='user_id',columns='movie_id',values='rating')

    >>> data[:5]
    movie_id  1   2   3   4   5   6 
    user_id                                                                       
    1          5 NaN NaN NaN NaN NaN ...
    2        NaN NaN NaN NaN NaN NaN ...
    3        NaN NaN NaN NaN NaN NaN ...
    4        NaN NaN NaN NaN NaN NaN ...
    5        NaN NaN NaN NaN NaN   2 ...
     

    >>> check_size = 1000

    >>> check = {}
    >>> check_data = data.copy()#复制一份 data 用于检验,以免篡改原数据
    >>> check_data = check_data.ix[check_data.count(axis=1)>200]#滤除评价数小于200的用户
    >>> for user in np.random.permutation(check_data.index):
            movie = np.random.permutation(check_data.ix[user].dropna().index)[0]
            check[(user,movie)] = check_data.ix[user,movie]
            check_data.ix[user,movie] = np.nan
            check_size -= 1
            if not check_size:
                break
     
    >>> corr = check_data.T.corr(min_periods=200)
    >>> corr_clean = corr.dropna(how='all')
    >>> corr_clean = corr_clean.dropna(axis=1,how='all')#删除全空的行和列
    >>> check_ser = Series(check)#这里是被提取出来的 1000 个真实评分
    >>> check_ser[:5]
    (15593)     4
    (23555)     3
    (333363)    4
    (362355)    5
    (533605)    4
    dtype: float64
     

    参考:

    Python 基于协同过滤的推荐

    利用python的theano库刷kaggle mnist排行榜

    每天一小步,人生一大步!Good luck~
  • 相关阅读:
    Linux系统下安装rz/sz命令及使用说明
    Linux 下Beanstalk安装
    Jetty中间件
    JBOSS应用中间件
    IBM 存储高可用HA解决方案和DR连续性解决方案
    Nginx负载均衡与反向代理的配置和优化
    NFS挂载网络存储
    使用集中式身份管理服务详解
    配置链路聚合(端口聚合)
    配置ssh远程访问策略
  • 原文地址:https://www.cnblogs.com/jkmiao/p/4443968.html
Copyright © 2011-2022 走看看