zoukankan      html  css  js  c++  java
  • python + hadoop (案例)

    python如何链接hadoop,并且使用hadoop的资源,这篇文章介绍了一个简单的案例!

    一、python的map/reduce代码

    首先认为大家已经对haoop已经有了很多的了解,那么需要建立mapper和reducer,分别代码如下:

    1、mapper.py

    #!/usr/bin/env python
    import sys
    for line in sys.stdin:
        line = line.strip()
        words = line.split()
        for word in words:
            print '%s	%s' %(word, 1)

    2、reducer.py

    #!/usr/bin/env python
    from operator import itemgetter
    import sys
    
    current_word = None
    current_count = 0
    word = None
    
    for line in sys.stdin:
        words = line.strip()
        word, count = words.split('	')
        
        try:
            count = int(count)
        except ValueError:
            continue
    
        if current_word == word:
            current_count += count
        else:
            if current_word:
                print '%s	%s' %(current_word, current_count)
            current_count = count
            current_word = word
    
    if current_word == word:
        print '%s	%s' %(current_word, current_count)

    建立了两个代码之后,测试一下:

    [qiu.li@l-tdata5.tkt.cn6 /export/python]$ echo "I like python hadoop , hadoop very good" | ./mapper.py | sort -k 1,1 | ./reducer.py
    ,    1
    good    1
    hadoop    2
    I    1
    like    1
    python    1
    very    1

    二、上传文件

    发现没啥问题,那么成功一半了,下面上传几个文件到hadoop做进一步测试。我在线上找了几个文件,命令如下:

    wget http://www.gutenberg.org/ebooks/20417.txt.utf-8
    wget http://www.gutenberg.org/files/5000/5000-8.txt
    wget http://www.gutenberg.org/ebooks/4300.txt.utf-8

    查看下载的文件:

    [qiu.li@l-tdata5.tkt.cn6 /export/python]$ ls
    20417.txt.utf-8  4300.txt.utf-8  5000-8.txt  mapper.py  reducer.py  run.sh

    上传文件到hadoop上面,命令如下:hadoop dfs -put ./*.txt /user/ticketdev/tmp (hadoop是配置好的,目录也是建立好的)

    建立run.sh

    hadoop jar $STREAM  
     -files ./mapper.py,./reducer.py 
     -mapper ./mapper.py 
     -reducer ./reducer.py 
     -input /user/ticketdev/tmp/*.txt 
     -output /user/ticketdev/tmp/output

    查看结果:

    [qiu.li@l-tdata5.tkt.cn6 /export/python]$ hadoop dfs -cat /user/ticketdev/tmp/output/part-00000 | sort -nk 2 | tail 
    DEPRECATED: Use of this script to execute hdfs command is deprecated.
    Instead use the hdfs command for it.
    
    it    2387
    which    2387
    that    2668
    a    3797
    is    4097
    to    5079
    in    5226
    and    7611
    of    10388
    the    20583

    三、参考文献:

    http://www.cnblogs.com/wing1995/p/hadoop.html?utm_source=tuicool&utm_medium=referral

  • 相关阅读:
    Spring Cloud Hystrix Dashboard的使用 5.1.3
    Spring Cloud Hystrix 服务容错保护 5.1
    Spring Cloud Ribbon 客户端负载均衡 4.3
    Spring Cloud 如何实现服务间的调用 4.2.3
    hadoop3.1集成yarn ha
    hadoop3.1 hdfs的api使用
    hadoop3.1 ha高可用部署
    hadoop3.1 分布式集群部署
    hadoop3.1伪分布式部署
    KVM(八)使用 libvirt 迁移 QEMU/KVM 虚机和 Nova 虚机
  • 原文地址:https://www.cnblogs.com/liqiu/p/6243043.html
Copyright © 2011-2022 走看看