zoukankan      html  css  js  c++  java
  • 【赵强老师】在MongoDB中使用MapReduce方式计算聚合

    MapReduce 能够计算非常复杂的聚合逻辑,非常灵活,但是,MapReduce非常慢,不应该用于实时的数据分析中。MapReduce能够在多台Server上并行执行,每台Server只负责完成一部分wordload,最后将wordload发送到Master Server上合并,计算出最终的结果集,返回客户端。
    MapReduce的基本思想,如下图所示:

    在这个例子中,我们以一个求和为例。首先执行Map阶段,把一个大任务拆分成若干个小任务,每个小任务运行在不同的节点上,从而支持分布式计算,这个阶段叫做Map(如蓝框所示);每个小任务输出的结果再进行二次计算,最后得到结果55,这个阶段叫做Reduce(如红框所示)。

    使用MapReduce方式计算聚合,主要分为三步:Map,Shuffle(拼凑)和Reduce,Map和Reduce需要显式定义,shuffle由MongoDB来实现。

    • Map:将操作映射到每个doc,产生Key和Value
    • Shuffle:按照Key进行分组,并将key相同的Value组合成数组
    • Reduce:把Value数组化简为单值

    我们以下面的测试数据(员工数据)为例,来为大家演示。

    db.emp.insert(
    [
    {_id:7369,ename:'SMITH' ,job:'CLERK'    ,mgr:7902,hiredate:'17-12-80',sal:800,comm:0,deptno:20},
    {_id:7499,ename:'ALLEN' ,job:'SALESMAN' ,mgr:7698,hiredate:'20-02-81',sal:1600,comm:300 ,deptno:30},
    {_id:7521,ename:'WARD'  ,job:'SALESMAN' ,mgr:7698,hiredate:'22-02-81',sal:1250,comm:500 ,deptno:30},
    {_id:7566,ename:'JONES' ,job:'MANAGER'  ,mgr:7839,hiredate:'02-04-81',sal:2975,comm:0,deptno:20},
    {_id:7654,ename:'MARTIN',job:'SALESMAN' ,mgr:7698,hiredate:'28-09-81',sal:1250,comm:1400,deptno:30},
    {_id:7698,ename:'BLAKE' ,job:'MANAGER'  ,mgr:7839,hiredate:'01-05-81',sal:2850,comm:0,deptno:30},
    {_id:7782,ename:'CLARK' ,job:'MANAGER'  ,mgr:7839,hiredate:'09-06-81',sal:2450,comm:0,deptno:10},
    {_id:7788,ename:'SCOTT' ,job:'ANALYST'  ,mgr:7566,hiredate:'19-04-87',sal:3000,comm:0,deptno:20},
    {_id:7839,ename:'KING'  ,job:'PRESIDENT',mgr:0,hiredate:'17-11-81',sal:5000,comm:0,deptno:10},
    {_id:7844,ename:'TURNER',job:'SALESMAN' ,mgr:7698,hiredate:'08-09-81',sal:1500,comm:0,deptno:30},
    {_id:7876,ename:'ADAMS' ,job:'CLERK'    ,mgr:7788,hiredate:'23-05-87',sal:1100,comm:0,deptno:20},
    {_id:7900,ename:'JAMES' ,job:'CLERK'    ,mgr:7698,hiredate:'03-12-81',sal:950,comm:0,deptno:30},
    {_id:7902,ename:'FORD'  ,job:'ANALYST'  ,mgr:7566,hiredate:'03-12-81',sal:3000,comm:0,deptno:20},
    {_id:7934,ename:'MILLER',job:'CLERK'    ,mgr:7782,hiredate:'23-01-82',sal:1300,comm:0,deptno:10}
    ]
    );
    

    (案例一)求员工表中,每种职位的人数

    var map1=function(){emit(this.job,1)}
    var reduce1=function(job,count){return Array.sum(count)}
    db.emp.mapReduce(map1,reduce1,{out:"mrdemo1"})
    

    (案例二)求员工表中,每个部门的工资总和

    var map2=function(){emit(this.deptno,this.sal)}
    var reduce2=function(deptno,sal){return Array.sum(sal)}
    db.emp.mapReduce(map2,reduce2,{out:"mrdemo2"})
    

    (案例三)Troubleshoot the Map Function

    定义自己的emit函数:
    var emit = function(key, value) {
    print("emit");
    print("key: " + key + "  value: " + tojson(value));
    }
    
    测试一条数据:
    emp7839=db.emp.findOne({_id:7839})
    map2.apply(emp7839)
    输出以下结果:
    emit
    key: 10  value: 5000
    
    测试多条数据:
    var myCursor=db.emp.find()
    while (myCursor.hasNext()) {
        var doc = myCursor.next();
        print ("document _id= " + tojson(doc._id));
        map2.apply(doc);
        print();
    }
    

    (案例四)Troubleshoot the Reduce Function

    一个简单的测试案例
    var myTestValues = [ 5, 5, 10 ];
    var reduce1=function(key,values){return Array.sum(values)}
    reduce1("mykey",myTestValues)
    
    测试:Reduce的value包含多个值
    测试数据:薪水、奖金:
    var myTestObjects = [
                          { sal: 1000, comm: 5 },
                          { sal: 2000, comm: 10 },
                          { sal: 3000, comm: 15 }
                        ];
    开发reduce方法:
    var reduce2=function(key,values) {
       reducedValue = { sal: 0, comm: 0 };
       for(var i=0;i<values.length;i++) {
         reducedValue.sal += values[i].sal;
         reducedValue.comm += values[i].comm;
       }  
       return reducedValue;
    }
    
    测试:
    reduce2("aa",myTestObjects)
    

  • 相关阅读:
    jenkins与gitlab集成,分支提交代码后自动构建任务(六)
    MySQL配置参数sync_binlog说明
    jenkins部署java项目(五)
    Mac OS X 中安装JDK 7
    centos 使用pip安装mysql-python
    CentOS6.4下安装TeamViewer8
    adb & adb shell 相关命令
    mac下限速
    mac 下 word 2011 使用笔记
    python twisted启动定时服务
  • 原文地址:https://www.cnblogs.com/collen7788/p/13662639.html
Copyright © 2011-2022 走看看