zoukankan      html  css  js  c++  java
  • MR hadoop streaming job的学习 combiner

    代码已经拷贝到了公司电脑的:

    /Users/baidu/Documents/Data/Work/Code/Self/hadoop_mr_streaming_jobs

    首先是主控脚本 main.sh

    调用的是 extract.py

    然后发现写的不太好。其中有一个combiner,可以看这里:

    https://blog.csdn.net/u010700335/article/details/72649186

    streaming 脚本的时候,是以管道为基础的:

    (5)  Python脚本

    1
    2
    3
    import sys
    for line in sys.stdin:
    .......

     

    #!/usr/bin/env python
     
    import sys
     
    # maps words to their counts
    word2count = {}
     
    # input comes from STDIN (standard input)
    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()
        # split the line into words while removing any empty strings
        words = filter(lambda word: word, line.split())
        # increase counters
        for word in words:
            # write the results to STDOUT (standard output);
            # what we output here will be the input for the
            # Reduce step, i.e. the input for reducer.py
            #
            # tab-delimited; the trivial word count is 1
            print '%s	%s' % (word, 1)
    #---------------------------------------------------------------------------------------------------------
    #!/usr/bin/env python
     
    from operator import itemgetter
    import sys
     
    # maps words to their counts
    word2count = {}
     
    # input comes from STDIN
    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()
     
        # parse the input we got from mapper.py
        word, count = line.split()
        # convert count (currently a string) to int
        try:
            count = int(count)
            word2count[word] = word2count.get(word, 0) + count
        except ValueError:
            # count was not a number, so silently
            # ignore/discard this line
            pass
     
    # sort the words lexigraphically;
    #
    # this step is NOT required, we just do it so that our
    # final output will look more like the official Hadoop
    # word count examples
    sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
     
    # write the results to STDOUT (standard output)
    for word, count in sorted_word2count:
        print '%s	%s'% (word, count)
  • 相关阅读:
    VS.NET提示"试图运行项目时出错:无法启动调试。绑定句柄无效"解决办法
    鼠标移动之hook学习
    今天完成任务之SQL中len的使用
    继承(2)方法《.NET 2.0面向对象编程揭秘 》
    框架设计:CLR Via C# 第二章
    启动IIS时提示“服务没有及时响应启动或控制请求”几种解决方法
    C#中处理字符串和数字
    TreeView实现权限管理
    鼠标单击右击双击简单功能的实现
    richTextBox 中插入图片
  • 原文地址:https://www.cnblogs.com/charlesblc/p/8831287.html
Copyright © 2011-2022 走看看