zoukankan html css js c++ java

集体智慧编程--第2章提供推荐

本章将告诉你如何构筑一个系统，用以寻找具有相同品味的人，并根据他人的喜好自动给出推荐。

协作型过滤

一个协同型过滤算法通常的做法是对一群人进行搜索，从中找出与我们品味相近的一群人。算法会对这些人所偏爱的其他内容进行考查，并将它们组合起来构造出一个经过排名的推荐列表。

搜索偏好

我们要做的第一件事情，是寻找一种表达不同人及其偏好的方法。在Python中，达到这一目的的一种简单的方法是使用嵌套的字典。
举如下一个例子，

#一个涉及影评者及其对几部影片评分情况的字典
critics={'Lisa Rose':{'Lady in the water':2.5,'Snakes on a Plane':3.5,'Just My Luck':3.0,'Superman Returns':3.5,'You,Me and Dupree':2.5,'The Night Listener':3.0},
         'Gene Seymour':{'Lady in the water':3.0,'Snakes on a Plane':3.5,'Just My Luck':1.5,'Superman Returns':5.0,'You,Me and Dupree':3.5,'The Night Listener':3.0},
         'Michael Phillips':{'Lady in the water':2.5,'Snakes on a Plane':3.0,'Superman Returns':3.5,'The Night Listener':4.0},
         'Claudia Puig':{'Snakes on a Plane':3.5,'Just My Luck':3.0,'Superman Returns':4.0,'You,Me and Dupree':2.5,'The Night Listener':4.5},
         'Mick LaSalle':{'Lady in the water':3.0,'Snakes on a Plane':4.0,'Just My Luck':2.0,'Superman Returns':3.0,'You,Me and Dupree':2.0,'The Night Listener':3.0},
         'Jack Matthews':{'Lady in the water':3.0,'Snakes on a Plane':4.0,'Superman Returns':5.0,'You,Me and Dupree':3.5,'The Night Listener':3.0},
         'Toby':{'Snakes on a Plane':4.5,'Superman Returns':4.0,'You,Me and Dupree':1.5}}

使用词典很方便，很容易对词典进行查询和修改。

>>critics['Lisa Rose']['Lady in the Water']
>>2.5
critics['Toby']['Snakes on a Plane']=4.5

寻找相近的用户

这里介绍两套计算相似度评价值的体系：
1.欧几里德距离评价

 #返回一个有关person1与person2的基于距离的相似度评价
    def sim_distance(prefs,person1,person2):
        #得到shared_items的列表
        si={}
        for item in prefs[person1]:
            if item in prefs[person2]:
                si[item]=1

        #如果两者没有共同之处返回0
        if len(si)==0:
            return 0
        #计算所有差值的平方和
        sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
                            for item in prefs[person1] if item in item in prefs[person2]])

        return 1/(1+sqrt(sum_of_squares))

2.皮尔逊相关度评价
该相关系数是判断两组数据在某一直线上拟合程度的一种度量。是比欧几里德距离更加复杂的可以判断人们兴趣的相似度的一种方法。该相关系数是判断两组数据与某一直线拟合程度的一种度量。它在数据不是很规范的时候，会倾向于给出更好的结果。

如图，Mick Lasalle为Superman评了3分，而Gene Seyour则评了5分，所以该影片被定位中图中的(3,5)处。在图中还可以看到一条直线。其绘制原则是尽可能地靠近图上的所有坐标点，被称为最佳拟合线。如果两位评论者对所有影片的评分情况都相同，那么这条直线将成为对角线，并且会与图上所有的坐标点都相交，从而得到一个结果为1的理想相关度评价。

皮尔逊相关度评价算法首先会找出两位评价者都评价过的物品，然后计算两者的评分总和和平方和。最后算法利用这些计算结果计算出皮尔逊相关系数。代码如下

from math import sqrt,pow

def sim_pearson(prefs, p1, p2):  
    # Get the list of mutually rated items  
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:
            si[item] = 1  
  
    # if they are no ratings in common, return 0  
    if len(si) == 0:  
        return 0  
      
    # Sum calculations  
    n = len(si)  
      
    # Sums of all the preferences  
    sum1 = sum([prefs[p1][it] for it in si])  
    sum2 = sum([prefs[p2][it] for it in si])  
      
    # Sums of the squares  
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])  
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])  
      
    # Sum of the products  
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])  
      
    # Calculate r (Pearson score)  
    num = pSum - (sum1 * sum2 / n)  
    den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))  
    if den == 0:  
        return 0  
      
    r = num/den  
      
    return r  
        
print(sim_pearson(critics,'Lisa Rose','Gene Seymour'))  
>>>0.39605901719066977

我们还可以使用许多其他的函数，如Jaccard系数和曼哈顿距离算法，作为相似度计算函数。只要它们满足如下条件：拥有同样的函数签名，以一个浮点数作为函数返回值，其数值越大代表相似度越大。

为评论者打分

在本例中我们对找寻与自己有相似品味的影评者很感兴趣，因为这样我们就知道在选择影片时应该采纳谁的建议了

def topMatches(prefs,person,n=5,similarity=sim_pearson):
    scores=[(similarity(prefs,person,other),other) for other in prefs if other!=person]
    
    #对列表进行排序，评价值最高者排在最前面
    scores.sort()
    scores.reverse()
    return scores[0:n]

print(topMatches(critics,'Toby',n=3))
#输出结果
[(0.9878291611472606, 'Lisa Rose'), (0.933256525257383, 'Mick LaSalle'), (0.88249750329277, 'Claudia Puig')]

评论者	相似度	Night	S.xNight	Lady	S.xLady	Luck	S.xLuck
Rose	0.99	3.0	2.97	2.5	2.48	3.0	2.97
Seymour	0.38	3.0	1.14	3.0	1.14	1.5	0.57
Puig	0.89	4.5	4.02			3.0	2.68
LaSalle	0.92	3.0	2.77	3.0	2.77	2.0	1.85
Matthews	0.66	3.0	1.99	3.0	1.99
总计			12.89		8.38		8.07
Sim.Sum			3.84		2.95		3.18
总计/Sim.Sum			3.35		2.83		2.53

匹配商品

我们已经知道了如何为指定人员寻找品味相似者，已经如何向其推荐商品的方法，但是如果想了解哪些商品是彼此相近的，那该如何做呢？
我们可以通过查看哪些人喜欢某一特定物品，已经这些人喜欢哪些其他物品来决定相似度，这和我们前面决定人与人之间相似度的方法一样，只需要将人员与物品对换即可。
可以用下面代码完成：

def transformPrefs(prefs):
    result={}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item,{})
            
            #将物品与人员对调
            result[item][person]=prefs[person][item]
    return result

movies=transformPrefs(critics)
print(topMatches(movies,'Superman Returns'))
print(getRecommendations(movies,'Just My Luck'))

#运行结果
[(0.7662337662337673, 'You,Me and Dupree'), (0.4879500364742689, 'Lady in the water'), (0.11180339887498941, 'Snakes on a Plane'), (-0.1798471947990544, 'The Night Listener'), (-0.42289003161103106, 'Just My Luck')]

[(4.0, 'Michael Phillips'), (3.0, 'Jack Matthews')]

利用它零售商可以找到购买特定商品的潜在用户。另一个用途是：在专门推荐链接的网站上，这样做可以确保新出现的链接能够被那些最有可能对它产生兴趣的网站用户找到。

基于物品的过滤

对于Amazon这样有着百万客户和商品的大型网站而言，将一个用户和所有其他用户进行比较，然后再对每位用户评价过分的商品进行比较，这速度可能是无法容忍的。
目前我们采取的技术是基于用户的协作型过滤和基于物品的协作型过滤。
基于物品的协作型过滤总体思路：为每件物品预先计算好最为相近的其他物品。然后我们想为某位用户提供推荐时，就可以查看他曾经评过分的物品，并从中选出排名靠前者，再构造出一个加权列表，其中包含了与这些选中物品最为相近的其他物品。这样做的显著区别在于，尽管第一步要求我们检查所有的数据，但是物品间的比较不会像用户间的比较那么频繁变化。

构造物品比较数据集

def calculateSimilarItems(prefs,n=10):
    #建立词典，以给出与这些物品最为相近的所有其他物品
    result={}
    
    #以物品为中心对偏好矩阵实施倒置处理
    itemPrefs=transformPrefs(prefs)
    c=0
    for item in itemPrefs:
        #针对大数据集更新状态变量
        c+=1
        if c%100==0: print("%d/%d" % (c,len(itemPrefs)))
        #寻找最为相近的物品
        scores=topMatches(itemPrefs,item,n=n,similarity=sim_pearson)
        result[item]=scores
    return result
           
print(calculateSimilarItems(critics))

#运行结果
{'Superman Returns': [(0.7662337662337673, 'You,Me and Dupree'), (0.4879500364742689, 'Lady in the water'), (0.11180339887498941, 'Snakes on a Plane'), (-0.1798471947990544, 'The Night Listener'), (-0.42289003161103106, 'Just My Luck')], 'Snakes on a Plane': [(0.7637626158259785, 'Lady in the water'), (0.11180339887498941, 'Superman Returns'), (-0.3333333333333333, 'Just My Luck'), (-0.560611910581388, 'You,Me and Dupree'), (-0.5663521139548527, 'The Night Listener')], 'Just My Luck': [(0.5555555555555556, 'The Night Listener'), (-0.3333333333333333, 'Snakes on a Plane'), (-0.42289003161103106, 'Superman Returns'), (-0.4856618642571827, 'You,Me and Dupree'), (-0.9449111825230676, 'Lady in the water')], 'The Night Listener': [(0.5555555555555556, 'Just My Luck'), (-0.1798471947990544, 'Superman Returns'), (-0.250000000000002, 'You,Me and Dupree'), (-0.5663521139548527, 'Snakes on a Plane'), (-0.6123724356957927, 'Lady in the water')], 'You,Me and Dupree': [(0.7662337662337673, 'Superman Returns'), (0.3333333333333333, 'Lady in the water'), (-0.250000000000002, 'The Night Listener'), (-0.4856618642571827, 'Just My Luck'), (-0.560611910581388, 'Snakes on a Plane')], 'Lady in the water': [(0.7637626158259785, 'Snakes on a Plane'), (0.4879500364742689, 'Superman Returns'), (0.3333333333333333, 'You,Me and Dupree'), (-0.6123724356957927, 'The Night Listener'), (-0.9449111825230676, 'Just My Luck')]}

获得推荐

为Toby提供基于物品的推荐

影片	评分	Night	R.xNight	Lady	R.xLady	Luck	R.xLuck
Snakes	4.5	0.182	0.818	0.222	0.999	0.105	0.474
Superman	4.0	0.103	0.412	0.091	0.363	0.065	0.258
Dupree	1.0	0.148	0.148	0.4	0.4	0.182	0.182
总计		0.433	1.378	0.713	1.762	0.352	0.914
归一化结果			3.183		2.473		2.598

其中此处每一行都列出一部我们曾经看过的电影，已经对该行的评价。对于每一部我们还未曾看过的影片，相应有一列会指出它与已观看影片的相似程度。已R.x开头列出我们对影片的评价值乘以相似度之后的结果。总计一行给出每部影片的相似度评价值的总计值及其R.x列的总计值。归一化结果一行是R.x列的总计值除以相似度一列的总计值。
代码实现如下。

def getRecommendedItems(prefs,itemMatch,user):
    userRatings=prefs[user]
    scores={}
    totalSim={}
    
    #循环遍历由当前用户评分的物品
    for (item,rating) in userRatings.items():
        
        #循环遍历与当前物品相近的商品
        for (similarity,item2) in itemMatch[item]:
            #如果该用户对当前商品做过评价，则将其忽略
            if item2 in userRatings:continue
             
            #评价值与相似度的加权之和
            scores.setdefault(item2,0)
            scores[item2]+=similarity*rating

            #全部相似度之和
            totalSim.setdefault(item2,0)
            totalSim[item2]+=similarity

    #将每个合计值除以加权和，求出平均值
    rankings=[(score/totalSim[item],item) for item,score in scores.items()]

                
    #按最高值到最低值的顺序返回评分结果
    rankings.sort()
    rankings.reverse()
    return rankings
      
itemsim=calculateSimilarItems(critics)      
print(getRecommendedItems(critics,itemsim,'Toby'))

#运行结果
[(3.7151804871832232, 'Lady in the water'), (3.656871933137409, 'The Night Listener'), (3.15653397806353, 'Just My Luck')]

查看全文

相关阅读:
Longhorn，企业级云原生容器分布式存储
 Longhorn，企业级云原生容器分布式存储
 Longhorn，企业级云原生容器分布式存储
 数通——VLAN
数通——动态路由
 AtCoder Beginner Contest 216 A~F 题解
 【YBTOJ】序列的第k个数
 【YBTOJ】【UVA10140】Prime Distance
【YBTOJ】【CodeForces 372C】Watching Fireworks is Fun
【YBTOJ】涂抹果酱

原文地址：https://www.cnblogs.com/bbn0111/p/6994573.html

集体智慧编程--第2章 提供推荐