OutputFormat输出过程的学习

zoukankan html css js c++ java

OutputFormat输出过程的学习
转自：http://blog.csdn.net/androidlushangderen/article/details/41278351

花了大约1周的时间，终于把MapReduce的5大阶段的源码学习结束掉了，收获不少，就算本人对Hadoop学习的一个里程碑式的纪念吧。今天花了一点点的时间，把MapReduce的最后一个阶段，输出OutputFormat给做了分析，这个过程跟InputFormat刚刚好是对着干的，二者极具对称性。为什么这么说呢，待我一一分析。

OutputFormat过程的作用就是定义数据key-value的输出格式，给你处理好后的数据，究竟以什么样的形式输出呢，才能让下次别人拿到这个文件的时候能准确的提取出里面的数据。这里，我们撇开这个话题，仅仅我知道的一些定义的数据格式的方法，比如在Redis中会有这样的设计:

[key-length][key][value-length][value][key-length][key][value-length][value]...

或者说不一定非要省空间,直接搞过分隔符

[key] [value]

[key] [value]

[key] [value]

.....

这样逐行读取，再以空格隔开，取出里面的键值对，这么做简单是简单，就是不紧凑，空间浪费得有点多。在MapReduce的OutputFormat的有种格式用的就是这种方式。

首先必须得了解OutputFormat里面到底有什么东西:
[java] view plain copy print ?

public interface OutputFormat<K, V> {



  /**

   * Get the {@link RecordWriter} for the given job.

   * 获取输出记录键值记录

   *

   * @param ignored

   * @param job configuration for the job whose output is being written.

   * @param name the unique name for this part of the output.

   * @param progress mechanism for reporting progress while writing to file.

   * @return a {@link RecordWriter} to write the output for the job.

   * @throws IOException

   */

  RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job,

                                     String name, Progressable progress)

  throws IOException;



  /**

   * Check for validity of the output-specification for the job.

   *

   * <p>This is to validate the output specification for the job when it is

   * a job is submitted.  Typically checks that it does not already exist,

   * throwing an exception when it already exists, so that output is not

   * overwritten.</p>

   * 作业运行之前进行的检测工作，例如配置的输出目录是否存在等

   *

   * @param ignored

   * @param job job configuration.

   * @throws IOException when output should not be attempted

   */

  void checkOutputSpecs(FileSystem ignored, JobConf job) throws IOException;

}
很简单的2个方法，RecordWriter比较重要，后面的key-value的写入操作都是根据他来完成的。但是他是一个接口，在MapReduce中，我们用的最多的他的子类是FileOutputFormat：
[java] view plain copy print ?

/** A base class for {@link OutputFormat}. */

public abstract class FileOutputFormat<K, V> implements OutputFormat<K, V> {
他是一个抽象类，但是实现了接口中的第二个方法checkOutputSpecs()方法：
[java] view plain copy print ?

public void checkOutputSpecs(FileSystem ignored, JobConf job)

    throws FileAlreadyExistsException,

           InvalidJobConfException, IOException {

    // Ensure that the output directory is set and not already there

    Path outDir = getOutputPath(job);

    if (outDir == null && job.getNumReduceTasks() != 0) {

      throw new InvalidJobConfException("Output directory not set in JobConf.");

    }

    if (outDir != null) {

      FileSystem fs = outDir.getFileSystem(job);

      // normalize the output directory

      outDir = fs.makeQualified(outDir);

      setOutputPath(job, outDir);



      // get delegation token for the outDir's file system

      TokenCache.obtainTokensForNamenodes(job.getCredentials(),

                                          new Path[] {outDir}, job);



      // check its existence

      if (fs.exists(outDir)) {

        //如果输出目录以及存在，则抛异常

        throw new FileAlreadyExistsException("Output directory " + outDir +

                                             " already exists");

      }

    }

  }
就是检查输出目录在不在的操作。在这个类里还出现了一个辅助类：
[java] view plain copy print ?

public static Path getTaskOutputPath(JobConf conf, String name)

  throws IOException {

    // ${mapred.out.dir}

    Path outputPath = getOutputPath(conf);

    if (outputPath == null) {

      throw new IOException("Undefined job output-path");

    }



    //根据OutputCommitter获取输出路径

    OutputCommitter committer = conf.getOutputCommitter();

    Path workPath = outputPath;

    TaskAttemptContext context = new TaskAttemptContext(conf,

                TaskAttemptID.forName(conf.get("mapred.task.id")));

    if (committer instanceof FileOutputCommitter) {

      workPath = ((FileOutputCommitter)committer).getWorkPath(context,

                                                              outputPath);

    }



    // ${mapred.out.dir}/_temporary/_${taskid}/${name}

    return new Path(workPath, name);

  }
就是上面OutputCommiter，里面定义了很多和Task,job作业相关的方法。很多时候都会与OutputFormat合作的形式出现。他也有自己的子类实现FileOutputCommiter:
[java] view plain copy print ?

public class FileOutputCommitter extends OutputCommitter {



  public static final Log LOG = LogFactory.getLog(

      "org.apache.hadoop.mapred.FileOutputCommitter");

/**

   * Temporary directory name

   */

  public static final String TEMP_DIR_NAME = "_temporary";

  public static final String SUCCEEDED_FILE_NAME = "_SUCCESS";

  static final String SUCCESSFUL_JOB_OUTPUT_DIR_MARKER =

    "mapreduce.fileoutputcommitter.marksuccessfuljobs";



  public void setupJob(JobContext context) throws IOException {

    JobConf conf = context.getJobConf();

    Path outputPath = FileOutputFormat.getOutputPath(conf);

    if (outputPath != null) {

      Path tmpDir = new Path(outputPath, FileOutputCommitter.TEMP_DIR_NAME);

      FileSystem fileSys = tmpDir.getFileSystem(conf);

      if (!fileSys.mkdirs(tmpDir)) {

        LOG.error("Mkdirs failed to create " + tmpDir.toString());

      }

    }

  }

  ....
在Reduce阶段的后面的写阶段，FileOutputFormat是默认的输出的类型：
[java] view plain copy print ?

//获取输出的key，value

    final RecordWriter<OUTKEY, OUTVALUE> out = new OldTrackingRecordWriter<OUTKEY, OUTVALUE>(

        reduceOutputCounter, job, reporter, finalName);



    OutputCollector<OUTKEY,OUTVALUE> collector =

      new OutputCollector<OUTKEY,OUTVALUE>() {

        public void collect(OUTKEY key, OUTVALUE value)

          throws IOException {

          //将处理后的key,value写入输出流中，最后写入HDFS作为最终结果

          out.write(key, value);

          // indicate that progress update needs to be sent

          reporter.progress();

        }

      };
out就是直接发挥作用的类，但是是哪个Formtat的返回的呢，我们进入OldTrackingRecordWriter继续看：
[java] view plain copy print ?

public OldTrackingRecordWriter(

        org.apache.hadoop.mapred.Counters.Counter outputRecordCounter,

        JobConf job, TaskReporter reporter, String finalName)

        throws IOException {

      this.outputRecordCounter = outputRecordCounter;

      //默认是FileOutputFormat文件输出方式

      this.fileOutputByteCounter = reporter

          .getCounter(FileOutputFormat.Counter.BYTES_WRITTEN);

      Statistics matchedStats = null;

      if (job.getOutputFormat() instanceof FileOutputFormat) {

        matchedStats = getFsStatistics(FileOutputFormat.getOutputPath(job), job);

      }

      fsStats = matchedStats;



      FileSystem fs = FileSystem.get(job);

      long bytesOutPrev = getOutputBytes(fsStats);

      //从配置中获取作业的输出方式

      this.real = job.getOutputFormat().getRecordWriter(fs, job, finalName,

          reporter);

      long bytesOutCurr = getOutputBytes(fsStats);

      fileOutputByteCounter.increment(bytesOutCurr - bytesOutPrev);

    }
果然是我们所想的那样，FileOutputFormat，但是不要忘了它的getRecordWriter()是抽象方法，调用它还必须由它的子类来实现:
[java] view plain copy print ?

public abstract RecordWriter<K, V> getRecordWriter(FileSystem ignored,

                                              JobConf job, String name,

                                              Progressable progress)

   throws IOException;
在这里我们举出其中在InputFormat举过的例子，TextOutputFormat,SequenceFileOutputFormat，与TextInputFormat,SequenceFileInputFormat对应。也就说说上面2个子类定义了2种截然不同的输出格式，也就返回了不一样的RecordWriter实现类.在TextOutputFormat中，他定义了一个叫LineRecordWriter的定义：
[java] view plain copy print ?

public RecordWriter<K, V> getRecordWriter(FileSystem ignored,

                                                 JobConf job,

                                                 String name,

                                                 Progressable progress)

   throws IOException {

//从配置判断输出是否要压缩

   boolean isCompressed = getCompressOutput(job);

   //配置中获取加在key-value的分隔符

   String keyValueSeparator = job.get("mapred.textoutputformat.separator",

                                      " ");

   //根据是否压缩获取相应的LineRecordWriter

   if (!isCompressed) {

     Path file = FileOutputFormat.getTaskOutputPath(job, name);

     FileSystem fs = file.getFileSystem(job);

     FSDataOutputStream fileOut = fs.create(file, progress);

     return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);

   } else {

     Class<? extends CompressionCodec> codecClass =

       getOutputCompressorClass(job, GzipCodec.class);

     // create the named codec

     CompressionCodec codec = ReflectionUtils.newInstance(codecClass, job);

     // build the filename including the extension

     Path file =

       FileOutputFormat.getTaskOutputPath(job,

                                          name + codec.getDefaultExtension());

     FileSystem fs = file.getFileSystem(job);

     FSDataOutputStream fileOut = fs.create(file, progress);

     return new LineRecordWriter<K, V>(new DataOutputStream

                                       (codec.createOutputStream(fileOut)),

                                       keyValueSeparator);

   }
他以一个内部类的形式存在于TextOutputFormat。而在SequenceFileOutputFormat中，他的形式是怎样的呢：
[java] view plain copy print ?

public RecordWriter<K, V> getRecordWriter(

                                        FileSystem ignored, JobConf job,

                                        String name, Progressable progress)

  throws IOException {

  // get the path of the temporary output file

  Path file = FileOutputFormat.getTaskOutputPath(job, name);



  FileSystem fs = file.getFileSystem(job);

  CompressionCodec codec = null;

  CompressionType compressionType = CompressionType.NONE;

  if (getCompressOutput(job)) {

    // find the kind of compression to do

    compressionType = getOutputCompressionType(job);



    // find the right codec

    Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job,

DefaultCodec.class);

    codec = ReflectionUtils.newInstance(codecClass, job);

  }

  final SequenceFile.Writer out =

    SequenceFile.createWriter(fs, job, file,

                              job.getOutputKeyClass(),

                              job.getOutputValueClass(),

                              compressionType,

                              codec,

                              progress);



  return new RecordWriter<K, V>() {



      public void write(K key, V value)

        throws IOException {



        out.append(key, value);

      }



      public void close(Reporter reporter) throws IOException { out.close();}

    };

}
关键的操作都在于SequenceFile.Writer中。有不同的RecordWriter就会有不同的写入数据的方式，这里我们举LineRecordWriter的例子。看看他的写入方法:
[java] view plain copy print ?

//往输出流中写入key-value

    public synchronized void write(K key, V value)

      throws IOException {



      //判断键值对是否为空

      boolean nullKey = key == null || key instanceof NullWritable;

      boolean nullValue = value == null || value instanceof NullWritable;



      //如果k-v都为空，则操作失败，不写入直接返回

      if (nullKey && nullValue) {

        return;

      }



      //如果key不空，则写入key

      if (!nullKey) {

        writeObject(key);

      }



      //如果key,value都不为空，则中间写入k-v分隔符，在这里为空格符

      if (!(nullKey || nullValue)) {

        out.write(keyValueSeparator);

      }



      //最后写入value

      if (!nullValue) {

        writeObject(value);

      }
在这个方法里，我们就能看出他的存储形式就是我刚刚在上面讲的第二种存储方式。这个方法将会在下面的代码中被执行：
[java] view plain copy print ?

OutputCollector<OUTKEY,OUTVALUE> collector =

      new OutputCollector<OUTKEY,OUTVALUE>() {

        public void collect(OUTKEY key, OUTVALUE value)

          throws IOException {

          //将处理后的key,value写入输出流中，最后写入HDFS作为最终结果

          out.write(key, value);

          // indicate that progress update needs to be sent

          reporter.progress();

        }

      };
过程可以这么理解:

collector.collect()------->out.write(key, value)------->LineRecordWriter.write(key, value)------->DataOutputStream.write(key, value).

DataOutputStream是内置于LineRecordWriter的作为里面的变量存在的。这样从Reduce末尾阶段到Output的过程也完全打通了。下面可以看看这上面涉及的完整的类目关系。

下一阶段的学习，可能或偏向于Task，Job阶段的过程分析，更加宏观过程上的一个分析。也可能会分析某个功能块的实现过程，比如Hadoop的IPC过程，据说用了很多JAVA NIO的东西。
查看全文

相关阅读:
面试官：你和队友之间选一个淘汰，你怎么选？
Spring Boot 如何干掉 if else？
坑爹的 Java 可变参数，把我整得够惨。。
厉害了，程序员的高考试卷，你能拿几分？
6个步骤，全方位掌握 Kafka
程序员逛酒吧，喝酒不是主要的。。
图解 Java 垃圾回收机制，写得非常好！
冲上云霄，Dubbo Go！
人工智能都能写Java了！这款插件让你编程更轻松
 说了多少遍，姿势要对！

原文地址：https://www.cnblogs.com/cxzdy/p/5043998.html