zoukankan      html  css  js  c++  java
  • simple way for sorting in secondary keys using hadoop

    A Useful Partitioner Class (secondary sort, the -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner option)

    Hadoop has a library class, KeyFieldBasedPartitioner, that is useful for many applications. This class allows the Map/Reduce framework to partition the map outputs based on certain key fields, not the whole keys. For example:

    $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar 
        -input myInputDirs 
        -output myOutputDir 
        -mapper org.apache.hadoop.mapred.lib.IdentityMapper 
        -reducer org.apache.hadoop.mapred.lib.IdentityReducer 
        -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 
        -D stream.map.output.field.separator=. 
        -D stream.num.map.output.key.fields=4 
        -D map.output.key.field.separator=. 
        -D mapred.text.key.partitioner.options=-k1,2
        -D mapred.reduce.tasks=12
    

    Here, -D stream.map.output.field.separator=. and -D stream.num.map.output.key.fields=4 are as explained in previous example. The two variables are used by streaming to identify the key/value pair of mapper.

    The map output keys of the above Map/Reduce job normally have four fields separated by ".". However, the Map/Reduce framework will partition the map outputs by the first two fields of the keys using the -D mapred.text.key.partitioner.options=-k1,2 option. Here, -D map.output.key.field.separator=. specifies the separator for the partition. This guarantees that all the key/value pairs with the same first two fields in the keys will be partitioned into the same reducer.

    This is effectively equivalent to specifying the first two fields as the primary key and the next two fields as the secondary. The primary key is used for partitioning, and the combination of the primary and secondary keys is used for sorting. A simple illustration is shown here:

    Output of map (the keys)

    11.12.1.2
    11.14.2.3
    11.11.4.1
    11.12.1.1
    11.14.2.2
    
    

    Partition into 3 reducers (the first 2 fields are used as keys for partition)

    11.11.4.1
    -----------
    11.12.1.2
    11.12.1.1
    -----------
    11.14.2.3
    11.14.2.2
    

    Sorting within each partition for the reducer(all 4 fields used for sorting)

    11.11.4.1
    -----------
    11.12.1.1
    11.12.1.2
    -----------
    11.14.2.2
    11.14.2.3
  • 相关阅读:
    CentOS6.4运维知识点1
    《C#入门详解》刘铁猛——Lesson10-11-12 操作符
    《C#入门详解》刘铁猛——Lesson8-9 方法的定义、调用与调试
    《C#入门详解》刘铁猛——Lesson3-4-5名称空间、类、对象、类成员以及C#基本元素
    《C#入门详解》刘铁猛——Lesson1-2 IDE、各种应用程序
    linq行转列
    json转dataset的另外一种解析方式自动生成guid强关联
    C#缓存
    大json直接序列化成dataset
    数据库中根据仓库数量拆分单据--通过游标实现
  • 原文地址:https://www.cnblogs.com/harveyaot/p/3342833.html
Copyright © 2011-2022 走看看