zoukankan      html  css  js  c++  java
  • Hadoop Mapreduce中wordcount 过程解析

    将文件split

    文件1:                                                                   分割结果:

    hello  world                                                   <0, "hello world">

    this is wordcount                                           <12,"this is wordcount">

    文件2:

    hello china                                                     <0,"hello china">

    hello IT                                                           <12,"hello IT">

    测试文件较小,所以一般测试文件就是一个split

    MapReduce 框架完成了以上分割

     

    Then,将分割好的<key ,value > 交给用户自定义的map 方法进行处理,生成新的<key,value>:

    <0, "hello world">                        map()                <hello,1> <world,1>                                          

    <12,"this is wordcount">             map()                 <this,1> <is,1> <wordcount,1>

    <0,"hello china">                         map()                 <hello,1> <china,1>

    <12,"hello IT">                            map()                  <hello,1><IT,1>

    map() reduce() 中间有个shuffle :

    <hello,1> <world,1>                         shuffle ()             <hello,1>                

    <this,1> <is,1> <wordcount,1>        shuffle ()              <is,1>

                                                                                        <wordcount,1>

                                                                                        <world,1>  

    <hello,1> <china,1>                         shuffle ()              <china,1> 

    <hello,1> <IT,1>                               shuffle ()               <hello,1>    

                                                                                          <hello,1>

                                                                                           <IT,1>

    分组,将相同的key 合并在一起:

    <hello,1>                        <hello,list(1)>        

    <is,1>                             <is,list(1)>        

    <wordcount,1>               <wordcount,list(1)>        

    <world,1>                      <world,list(1)>        

    <china,1>                        <china,list(1)>        

    <hello,1>    

    <hello,1>                          <hello,list(2)>        

     <IT,1>                             <IT,1>

    <china,list(1)>        

    <hello,list(1,2)>        

    <is,list(1)>  

    <wordcount,list(1)>  

    <world,list(1)>

    <IT,list(1)>                 

    得到最新的<key,value> 之后,再交给用户的reduce()方法,得到最新的<key,value >,并组为wordcount 的结果输出:

    <china,1>        

    <hello,3>

    <is,1>

    <wordcount,1>

    <world,1>

    <IT,1>   

  • 相关阅读:
    Odoo权限设置机制
    Odoo10配置文件
    Odoo10——self的使用
    Odoo10 启动选项
    ubuntu安装nginx
    pycharm快捷键一览
    前端 -- HTML
    前端 -- CSS
    前端 -- JavaScript
    前端 -- BOM和DOM
  • 原文地址:https://www.cnblogs.com/pickKnow/p/10767222.html
Copyright © 2011-2022 走看看