zoukankan      html  css  js  c++  java
  • Python & MapReduce

    使用Python实现Hadoop MapReduce程序

     

    原文请参考:

    http://blog.csdn.net/zhaoyl03/article/details/8657031/

    下面只是将mapper.py和reducer.py在windows上运行了一遍,没有用Hadoop的环境去测试。

    环境准备:

    1. Window 7 – 32
    2. 安装GunWin32,使得Linux命令可以在cmd上执行
    3. 安装IDLE (Python GUI),使得Python脚本可以执行
    4. 将Python的安装路径添加到windows的环境变量中,使得在cmd窗口中切换到Python脚本所在目录时,通过输入脚本名,可以直接执行Python脚本

    我的Python安装在: C:Python27python.exe下

    测试脚本放在: E:PythonTest下

    windows环境变量中增加:C:Python27

    mapper.py :

     

    #!/usr/bin/env python  
      
    import sys  
      
    # input comes from STDIN (standard input)  
    for line in sys.stdin:  
        # remove leading and trailing whitespace  
        line = line.strip()  
        # split the line into words  
        words = line.split()  
        # increase counters  
        for word in words:  
            # write the results to STDOUT (standard output);  
            # what we output here will be the input for the  
            # Reduce step, i.e. the input for reducer.py  
            #  
            # tab-delimited; the trivial word count is 1  
            print '%s	%s' % (word, 1)  

     

     

    reducer.py :

     

    #!/usr/bin/env python  
      
    from operator import itemgetter  
    import sys  
      
    current_word = None  
    current_count = 0  
    word = None  
      
    # input comes from STDIN  
    for line in sys.stdin:  
        # remove leading and trailing whitespace  
        line = line.strip()  
      
        # parse the input we got from mapper.py  
        word, count = line.split('	', 1)  
      
        # convert count (currently a string) to int  
        try:  
            count = int(count)  
        except ValueError:  
            # count was not a number, so silently  
            # ignore/discard this line  
            continue  
      
        # this IF-switch only works because Hadoop sorts map output  
        # by key (here: word) before it is passed to the reducer  
        if current_word == word:  
            current_count += count  
        else:  
            if current_word:  
                # write result to STDOUT  
                print '%s	%s' % (current_word, current_count)  
            current_count = count  
            current_word = word  
      
    # do not forget to output the last word if needed!  
    if current_word == word:  
        print '%s	%s' % (current_word, current_count) 

    输出结果:

  • 相关阅读:
    未能正确加载“Microsoft.VisualStudio.Editor.Implementation.EditorPackage”
    未能正确加载“Microsoft.VisualStudio.Editor.Implementation.EditorPackage”包
    重装VS2010时出现未能正确加载 "radlangsvc.package,radlangsvc.vs...
    page.Response.WriteFile(newpath);
    Response.ContentType 详细列表 <转>
    创建存储过程,使用游标更新表信息
    淘宝顶端的通知样式 .
    ssm整合各配置文件
    XSS-Labs(Level1-10)
    局域网技术
  • 原文地址:https://www.cnblogs.com/kevin-yuan/p/4485143.html
Copyright © 2011-2022 走看看