zoukankan      html  css  js  c++  java
  • 基于概率的项目相似度之并行方法

           推荐系统是个好东西,数据越大一般情况下效果越好,其挑战就是运算量问题,并行处理是近几年烽烽火火的一个主题,本文就从并行的角度来实现上一篇博客的算法

           实现平台:Hadoop-1.0.3  ;  Hive-0.8.1 ;  Eclipse SDK Version: 3.3.2

           资源:Hadoop 14 个节点  

    =====================================步骤=======================================

    Step1:底层数据的构建

    create external table dm_fan_prob_basic(user string, item int)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    stored as textfile
    location '/user/hive/fan/kuaixiu/prob/basic';

    底层数据由俩个字段组成,用户字段跟项目字段,估计这是最简单的也是适应性最大的基础数据了

    Step2:计算任意俩项目在同一用户里面出现的次数

    create external table dm_fan_prob_co(item_a int, item_b int, co int COMMENT 'The known number of users that like both') 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    stored as textfile
    location '/user/hive/fan/kuaixiu/prob/co';
    insert overwrite table dm_fan_prob_co
    select r.item_a, r.item_b, count(*) as co
    from(
    select a.user, a.item as item_a, b.item as item_b
    from dm_fan_prob_basic a
    join dm_fan_prob_basic b
    on a.user = b.user
    )r
    group by r.item_a, r.item_b;

    Step3:计算项目之类的概率相似度

    这一步是最难的,单纯用Hive SQL没办法完成工作,所以需要用Eclipse写相应的自定义函数,而且这一步没办法一次性完成,分解为多个小表,首先要先统计出用户总量(U)与项目总量(A)还有事件总数(V),这三步是为了求出项目相似度中的Pi系数,最终求出的结果是Pi = 0.003

    Step3.1:计算每个用户的事件总量L(u)

    create table dm_fan_prob_fu as
    select user, (${hiveconf:A} - count(*))/count(*) as fu
    from dm_fan_prob_basic
    group by user;
    其中${hiveconf:A}是整个系统的项目总量

    Step3.2:计算每个项目与多少个用户有过交互行为L(x)

    create table dm_fan_prob_fi as
    select item, (${hiveconf:U} - count(*))/count(*) as fi
    from dm_fan_prob_basic
    group by item;
    其中${hiveconf:U}是用户总量

    Step3.3:联合表 dm_fan_prob_co 和表 dm_fan_prob_fi

    此一目的是为了下一步做再一次join操作的时候减少运算量

    add jar /tmp/juefan/function/Prob_Sim.jar;
    create temporary function udafitem as 'Item.UDAFItem';
    create table dm_fan_prob_iteminfo as
    select a.item_a, udafitem(concat(a.item_b, ',', a.co, ',', b.fi) )as iteminfo
    from dm_fan_prob_co a
    join dm_fan_prob_fi b
    on a.item_b = b.item
    group by a.item_a;

    上面这一步生成的表结构如下

    项目   Array[Struct{相关项目,共同访问次数,相关项目与多少用户有过交互}]

    上面的UDAF最个最简单的自定义聚合函数,代码如下

    package Item;
    import org.apache.hadoop.hive.ql.exec.UDAF;
    import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
    public class UDAFItem extends UDAF{
        public static class ArrayEvaluator implements UDAFEvaluator{
            //最终返回的结果
            String value;
            
            public ArrayEvaluator(){
                super();
            }
            public void init(){
                value = null;
            }
            //将多个字符串用”;”连接起来
            //实质就是Map操作
            public boolean iterate(String o){
                if(o != null){
                    if(value == null){
                        value = o;
                    }else {
                        value = value + ";" + o;
                    }            
                }
                return true;
            }
            //相当于无作为的Combine
            public String terminatePartial(){
                return value;
            }
            //相当于Reduce操作
            public boolean merge(String o){
                if(o != null){
                    if(value != null){
                        value = value + ";" + o;
                    }else {
                        value = o;
                    }
                }
                return true;        
            }
            //返回最终连接好的数据
            public String terminate(){
                return value;
            }
        }
    }

    Step3.4: 连接表 dm_fan_prob_basic 和表 dm_fan_prob_fu

    同样也是为了减少运算量,且Step3.3与Step3.4都是以项目为表主键的,在集群中并行操作的时候易于将相同的键值发布到相同的节点中运行

    add jar /tmp/juefan/function/Prob_Sim.jar;
    create temporary function udafitem as 'Item.UDAFItem';
    create table dm_fan_prob_userinfo as
    select a.item, udafitem(concat(a.user, ',', b.fu)) as userinfo
    from dm_fan_prob_basic a
    join dm_fan_prob_fu b
    on a.user = b.user
    group by a.item;

    上面这边步用的UDAF与Step3.3的是同样的,其返回的表结构为

    项目   Array[Struct{用户, 用户的事件总量}]

    接下来就可以利用这俩步的结果来计算项目间的相似度了

    Step3.5:计算项目相似度

    create external table dm_fan_prob_sim(item_a int, item_b int, sim double COMMENT 'How much more than expected are you to like item_a if you like item_b ') 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    stored as textfile
    location '/user/hive/fan/kuaixiu/prob/sim';
    add jar /tmp/juefan/function/Prob_Sim.jar;
    create temporary function udfitem as 'Item.UDFItem';
    create temporary function udtfitem as 'Item.UDTFItem';
    insert overwrite table dm_fan_prob_sim
    select a.item as item_a, b.item_b, cast(b.sim as double) as sim
    from(
    select a.item, udfitem(b.iteminfo, a.userinfo, 0.003) as simstring
    from dm_fan_prob_userinfo a
    join dm_fan_prob_iteminfo b
    on a.item = b.item_a
    )a
    lateral view udtfitem(a.simstring)b as item_b, sim;

    这一步用到俩个自定义函数

    首先将Step3.3跟Step3.4以项目为键进行join操作

    得到的数据格式如下

    项目 Array[Strunt{相关项目,共同出现次数,相关项目出现次数}]   Array[Struct{相关用户,相关用户的事件总数}]

    ===UDF函数的源码===
    package
    Item; import java.util.ArrayList; import org.apache.hadoop.hive.ql.exec.UDF; public class UDFItem extends UDF{ //存储相关项目信息 public class Item{ public String item_b; public int co; public double iteminfo; } //存储相关用户信息 public class User{ public String user; public double userinfo; } /** * 计算item的概率相似度 * @param iteminfo item的相关信息 * @param userinfo user的相关信息 * @param pi item-user系数 * @return item-item的概率相似度链 */ public String evaluate(String iteminfo, String userinfo, double pi){ StringBuilder resultBuilder = new StringBuilder(""); ArrayList<Item> itemArrayList = new ArrayList<Item>(); ArrayList<User> userArrayList = new ArrayList<User>(); String[] itemvalue = iteminfo.split(";"); String[] uservalue = userinfo.split(";"); int isize = itemvalue.length; int usize = uservalue.length; for(int i = 0; i < isize; i++){ Item tmpItem = new Item(); tmpItem.item_b = itemvalue[i].split(",")[0]; tmpItem.co = Integer.parseInt(itemvalue[i].split(",")[1]); tmpItem.iteminfo = Double.parseDouble(itemvalue[i].split(",")[2]); itemArrayList.add(tmpItem); } for(int i = 0; i < usize; i++){ User tmpUser = new User(); tmpUser.user = uservalue[i].split(",")[0]; tmpUser.userinfo = Double.parseDouble(uservalue[i].split(",")[1]); userArrayList.add(tmpUser); } /**计算指定item与所有item的概率相似度**/ for(int i = 0; i < isize; i++){ double sum = 0.0; for(int j = 0; j < usize; j++){ //第二个值是user(j)选择item(i)的概率值 sum = sum + 1/(1 + pi*itemArrayList.get(i).iteminfo*userArrayList.get(j).userinfo); } resultBuilder.append(itemArrayList.get(i).item_b).append(",").append(itemArrayList.get(i).co/sum).append(";"); } return resultBuilder.toString(); } }
    ===UDTF函数的源码===
    package
    Item; import java.util.ArrayList; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; public class UDTFItem extends GenericUDTF{ public void close() throws HiveException{ } public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException{ if(args.length != 1){ throw new UDFArgumentLengthException("Test takes only one argument!"); } if(args[0].getCategory() != ObjectInspector.Category.PRIMITIVE){ throw new UDFArgumentException("Test takes string as a parameter"); } ArrayList<String> fieldNames = new ArrayList<String>(); ArrayList<ObjectInspector> fieldOis = new ArrayList<ObjectInspector>(); fieldNames.add("item"); fieldOis.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); fieldNames.add("sim"); fieldOis.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOis); }
    public void process(Object[] args) throws HiveException{ String inputString = args[0].toString(); String[] teStrings = inputString.split(";"); int size = teStrings.length; for(int i = 0; i < size; i++){ try{ String[] resultStrings = new String[2]; resultStrings[0] = new String(teStrings[i].split(",")[0]); resultStrings[1] = new String(teStrings[i].split(",")[1]); forward(resultStrings); }catch (Exception e) { continue; } } } }

    ======================================The End====================================

    以上步骤即可得到项目相似度最终结果

    关于TopN推荐结果那一块改天再写...............

  • 相关阅读:
    原来四五年没有写过什么东西了
    Spark难道比oracle性能还差?百万级数据测试性能
    程序人常去的网站(转)
    Android中关于dip和px以及转换的总结
    padding与margin的区别
    Python装饰器探究——装饰器参数
    Python 装饰器执行顺序迷思
    Python CSV模块简介
    理解线程3 c语言示例线程基本操作
    Python 外部函数调用库ctypes简介
  • 原文地址:https://www.cnblogs.com/juefan/p/3023544.html
Copyright © 2011-2022 走看看