zoukankan      html  css  js  c++  java
  • Python MR程序示例

    hadoop jar hadoop-streaming-2.6.4.jar
    -D mapreduce.job.name='test'
    -files /local/path/to/mapper.py,/local/path/to/reducer.py
    -input /test/data/*
    -output /test/output/
    -mapper 'python /local/path/to/mapper.py'
    -reducer 'python /local/path/to/reducer.py'

    1. python文件需要分发到每个节点
    2. -mapper和-reducer后面必须带python,否则会报错
    Caused by: java.io.IOException: Cannot run program "mapper.py": error=2, No such file or director

    mapper.py

    #!/usr/bin/python3
    # -*- coding: utf-8 -*-
    
    import os
    import sys
    import re
    
    for line in sys.stdin:
    line = line.strip()
    words = re.split('[,.?s"]',line)
    for word in words:
    word = word.strip(',|.|?|s')
    if word:
    
    print("{0}	{1}".format(word,1))


    reducer.py

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import os
    import sys
    from operator import itemgetter
    current_word = None
    current_count = 0
    word = None
    
    for line in sys.stdin:
    word = line.split('	',1)[0]
    count = line.split('	',1)[1]
    count = int(count)
    if current_word == word:
    current_count+=count
    else:
    if current_word:
    print("{0}	{1}".format(current_word,current_count))
    current_word = word
    current_count = count
    
    if word:
    print("{0}	{1}".format(current_word,current_count))

    参考官方说明: https://hadoop.apache.org/docs/r2.7.7/hadoop-streaming/HadoopStreaming.html

  • 相关阅读:
    面向领域的微服务架构
    java常用工具类
    java字节码解析
    详解 Java 内部类
    MongoDB配置教程
    oracle18c相关
    VBS编辑文件夹下所有excel文档
    oracle新增主键
    sqlldr加载字符问题
    ora-00257
  • 原文地址:https://www.cnblogs.com/zhaohz/p/12342777.html
Copyright © 2011-2022 走看看