关于MapReduce中自定义分区类（四）

zoukankan html css js c++ java

关于MapReduce中自定义分区类（四）
MapTask类
在MapTask类中找到run函数

if(useNewApi){

      runNewMapper(job, splitMetaInfo, umbilical, reporter);

    }

再找到runNewMapper

@SuppressWarnings("unchecked")

private<INKEY,INVALUE,OUTKEY,OUTVALUE>

void runNewMapper(final JobConf job,

                    final TaskSplitIndex splitIndex,

                    final TaskUmbilicalProtocol umbilical,

                    TaskReporter reporter

                    ) throws IOException,ClassNotFoundException,

                             InterruptedException{

    // make a task context so we can get the classes

    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =

      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,

                                                                  getTaskID(),

                                                                  reporter);

    // make a mapper

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =

      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)

        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);

    // make the input format

    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =

      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)

        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);

    // rebuild the input split

    org.apache.hadoop.mapreduce.InputSplit split = null;

    split = getSplitDetails(newPath(splitIndex.getSplitLocation()),

        splitIndex.getStartOffset());

    LOG.info("Processing split: "+ split);

    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =

      newNewTrackingRecordReader<INKEY,INVALUE>

        (split, inputFormat, reporter, taskContext);

    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());

    org.apache.hadoop.mapreduce.RecordWriter output = null;

    // get an output object

    if(job.getNumReduceTasks()==0){

      output = 如果jreduce个数等于0.则执行该方法

        newNewDirectOutputCollector(taskContext, job, umbilical, reporter);

    }else{

如果reduce个数大于0.则执行该方法

      output =newNewOutputCollector(taskContext, job, umbilical, reporter);

    }

    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE>

    mapContext =

      newMapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(),

          input, output,

          committer,

          reporter, split);

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context

        mapperContext =

          newWrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(

              mapContext);

    try{

      input.initialize(split, mapperContext);

      mapper.run(mapperContext);

      mapPhase.complete();

      setPhase(TaskStatus.Phase.SORT);

      statusUpdate(umbilical);

      input.close();

      input = null;

      output.close(mapperContext);

      output = null;

    } finally {

      closeQuietly(input);

      closeQuietly(output, mapperContext);

    }

}

我们知道，分区是在map函数输出的时候做的，所以这里是get output object

// get an output object

    if(job.getNumReduceTasks()==0){

      output = 如果jreduce个数等于0.则执行该方法

        newNewDirectOutputCollector(taskContext, job, umbilical, reporter);

    }else{

       如果reduce个数大于0.则执行该方法

      output =newNewOutputCollector(taskContext, job, umbilical, reporter);

    }

如果没有reduce任务，则new NewDirectOutputCollector()

（Collection过程我还没探索过呢）

如果有NewOutputCollector任务，则运行new NewOutputCollector()

内部类NewOutputCollector

在内部类NewOutputCollector中找到该方法（构造方法）

NewOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,

                       JobConf job,

                       TaskUmbilicalProtocol umbilical,

                       TaskReporter reporter

                       ) throws IOException,ClassNotFoundException{

      collector = createSortingCollector(job, reporter);

      partitions = jobContext.getNumReduceTasks();

      if(partitions >1){

        partitioner =(org.apache.hadoop.mapreduce.Partitioner<K,V>)

          ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);

      }else{

        partitioner =new org.apache.hadoop.mapreduce.Partitioner<K,V>(){

          @Override

          publicint getPartition(K key, V value,int numPartitions){

            return partitions -1;

          }

        };

      }

    }

通过partitions = jobContext.getNumReduceTasks();语句获取到Reduce任务个数

如果Reduce任务数小于等于1，则新建一个Partitioner对象的同时并复写getPartition方法，这个复写的方法直接统一返回-1，就都在一个分区了。

如果Reduce任务数大于，则通过反射创建jobContext.getPartitionerClass()获取到的对象

于是查看：

jobContext接口

jobContext接口中的

/**

   * Get the {@link Partitioner} class for the job.

   *

   * @return the {@link Partitioner} class for the job.

   */

publicClass<? extends Partitioner<?,?>> getPartitionerClass()

     throws ClassNotFoundException;

我们还是看其实现类jobContextImpl吧

jobContextImpl类

注意是在mapreduce包下啊，不是mapred包下

/**

   * Get the {@link Partitioner} class for the job.

   *

   * @return the {@link Partitioner} class for the job.

   */

@SuppressWarnings("unchecked")

publicClass<? extends Partitioner<?,?>> getPartitionerClass()

     throws ClassNotFoundException{

    return(Class<? extends Partitioner<?,?>>)

      conf.getClass(PARTITIONER_CLASS_ATTR,HashPartitioner.class);

}

conf.getClass(PARTITIONER_CLASS_ATTR, HashPartitioner.class);

的意思是，从PARTITIONER_CLASS_ATTR属性中取出值，作为类返回，如果不存在，则使用和默认值HashPartitioner.class

也就是说，当Reduce个数大于1的时候，其默认调用的是HashPartitioner.class

publicclassHashPartitioner<K, V>extendsPartitioner<K, V>{

/** Use {@link Object#hashCode()} to partition. */

publicint getPartition(K key, V value,

int numReduceTasks){

return(key.hashCode()&Integer.MAX_VALUE)% numReduceTasks;

}

}

发现HashPartitioner调用的是getPartition方法，最终使用的是key对象中的hashcode方法

而我们使用eclipse（Alt+Shift+ S 按下H）复写的hashcode是将两个属性（账户和金额都考虑进去了）

嗯，果然自己修改自定义key类中的hashcode，测试了一下是可以的，只要hashcode是只根据我们的账户account进行生产

@Override

        publicint hashCode(){

            final int prime =31;

            int result =1;

result = prime * result +((account == null)?0: account.hashCode());

//   result = prime * result + ((amount == null) ? 0 : amount.hashCode());

            return result;

        }

另一种更主流的方式：

自定义的Partition类为什么要是Group的内部类呢？自己改为外部类自己测试下，发现完全可以

具体的形式

publicstaticclassKeyPartitioner extends Partitioner<SelfKey,DoubleWritable>{

            @Override

            publicint getPartition(SelfKey key,DoubleWritable value,int numPartitions){

                /**

                 * 如何保证数据整体输出上的有序，需要我们自定义业务逻辑

                 * 必须提示前知道num reduce task 个数？

                 * w 单词字符[a-zA-Z_0-9]

                 *

                 */

                String account =key.getAccount();

                //0xxaaabbb 0-9

                //[0-2][3-6][7-9]

                if(account.matches("\w*[0-2]")){

                    return0;

                }elseif(account.matches("\w*[3-6]")){

                    return1;

                }elseif(account.matches("\w*[7-9]")){

                    return2;

                }

                return0;

            }

        }

这是为了保证S1和S2都在分区1，而不会出现S1中的其中几个在分区1 ，另外几个在分区2

因为我们此时的键——是账户+金额，所以可能明明都是账户S1的分区却不一样，最后导致排序混乱？
来自为知笔记(Wiz)
查看全文

相关阅读:
设计模式(17) 访问者模式(VISITOR) C++实现
 Effective C++(20) 继承与面向对象设计
 Google论文(1) GFS：Google文件系统
 设计模式(16) 观察者模式（OBSERVER）C++实现
 海量数据处理面试题(2) 将用户的query按出现频度排序
 海量数据处理面试题(1) 找出两文件种包含的相同的url
深入探索C++对象模型（1）关于对象（思维导图）
服务器编程入门（13） Linux套接字设置超时的三种方法
 技术笔记：Indy的TIdSMTP改造，解决发送Html和主题截断问题
 技术笔记：Delphi多线程应用读写锁

原文地址：https://www.cnblogs.com/xuanlvshu/p/5750405.html