zoukankan      html  css  js  c++  java
  • Hadoop Streaming Made Simple using Joins and Keys with Python « All Things Hadoop

    Hadoop Streaming Made Simple using Joins and Keys with Python « All Things Hadoop

    Home > Hadoop, MapReduce, Python > Hadoop Streaming Made Simple using Joins and Keys with Python

    There are a lot of different ways to write MapReduce jobs!!!

    I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).

    When doing streaming with Hadoop you do have a few library options.  If you are a Ruby programmer then wukong is awesome! For Python programmers you can use dumbo and more recently released mrjob.

    I like working under the hood myself and getting down and dirty with the data and here is how you can too.

    Lets start first with defining two simple sample data sets.

    Data set 1:  countries.dat

    name|key

    Data set 2: customers.dat

    name|type|country

    The requirements: you need to find out grouped by type of customer how many of each type are in each country with the name of the country listed in the countries.dat in the final result (and not the 2 digit country name).

    To-do this you need to:

    1) Join the data sets
    2) Key on country
    3) Count type of customer per country
    4) Output the results

    So first lets code up a quick mapper called smplMapper.py (you can decide if smpl is short for simple or sample).

    Now in coding the mapper and reducer in Python the basics are explained nicely here http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ but I am going to dive a bit deeper to tackle our example with some more tactics.

    Don’t forget:

    Great! We just took care of #1 but time to test and see what is going to the reducer.

    From the command line run:

    Which will result in:

    Hadoop Streaming Made Simple using Joins and Keys with Python « All Things Hadoop

    Home > Hadoop, MapReduce, Python > Hadoop Streaming Made Simple using Joins and Keys with Python

    There are a lot of different ways to write MapReduce jobs!!!

    I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).

    When doing streaming with Hadoop you do have a few library options.  If you are a Ruby programmer then wukong is awesome! For Python programmers you can use dumbo and more recently released mrjob.

    I like working under the hood myself and getting down and dirty with the data and here is how you can too.

    Lets start first with defining two simple sample data sets.

    Data set 1:  countries.dat

    name|key

    Data set 2: customers.dat

    name|type|country

    The requirements: you need to find out grouped by type of customer how many of each type are in each country with the name of the country listed in the countries.dat in the final result (and not the 2 digit country name).

    To-do this you need to:

    1) Join the data sets
    2) Key on country
    3) Count type of customer per country
    4) Output the results

    So first lets code up a quick mapper called smplMapper.py (you can decide if smpl is short for simple or sample).

    Now in coding the mapper and reducer in Python the basics are explained nicely here http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ but I am going to dive a bit deeper to tackle our example with some more tactics.

    Don’t forget:

    Great! We just took care of #1 but time to test and see what is going to the reducer.

    From the command line run:

    Which will result in:

    Notice how this is sorted so the country is first and the people in that country after it (so we can grab the correct country name as we loop) and with the type of customer also sorted (but within country) so we can properly count the types within the country. =8^)

    Let us hold off on #2 for a moment (just hang in there it will all come together soon I promise) and get smplReducer.py working first.

    Don’t forget:

    And then run:

    And voila!

    So now #3 and #4 are done but what about #2? 

    First put the files into Hadoop:

    And now run it like this (assuming you are running as hadoop in the bin directory):

    Let us look at what we did:

    Which results in:

    So #2 is the partioner KeyFieldBasedPartitioner explained here further Hadoop Wiki On Streaming which allows the key to be whatever set of columns you output (in our case by country) configurable by the command line options and the rest of the values are sorted within that key and sent to the reducer together by key.

    And there you go … Simple Python Scripting Implementing Streaming in Hadoop.  

    Grab the tar here and give it a spin.

    Notice how this is sorted so the country is first and the people in that country after it (so we can grab the correct country name as we loop) and with the type of customer also sorted (but within country) so we can properly count the types within the country. =8^)

    Let us hold off on #2 for a moment (just hang in there it will all come together soon I promise) and get smplReducer.py working first.

    Don’t forget:

    And then run:

    And voila!

    So now #3 and #4 are done but what about #2? 

    First put the files into Hadoop:

    And now run it like this (assuming you are running as hadoop in the bin directory):

    Let us look at what we did:

    Which results in:

    So #2 is the partioner KeyFieldBasedPartitioner explained here further Hadoop Wiki On Streaming which allows the key to be whatever set of columns you output (in our case by country) configurable by the command line options and the rest of the values are sorted within that key and sent to the reducer together by key.

    And there you go … Simple Python Scripting Implementing Streaming in Hadoop.  

    Grab the tar here and give it a spin.

  • 相关阅读:
    为什么我的tomcat启动不起来?
    图解leetcode —— 128. 最长连续序列
    java实现单链表增删改查
    搞定java String校招面试题
    java反射快速入门
    java中线程安全,线程死锁,线程通信快速入门
    理清Java中try-catch-finally带return的执行顺序
    Java可变参数与Collections工具类使用了解
    HashMap常见面试题整理
    ArrayList去除重复元素(多种方法实现)
  • 原文地址:https://www.cnblogs.com/lexus/p/2377927.html
Copyright © 2011-2022 走看看