zoukankan      html  css  js  c++  java
  • 解读:MultipleOutputs类

    //MultipleOutputs类用于简化多文件输出
    The MultipleOutputs class simplifies writing output data to multiple outputs
    //案例一:在job默认的输出之外,附加自定义的输出.自定义的输出可以指定:输出格式以及 key/value 类型. Case one: writing to additional outputs other than the job
    default output. Each additional output, or named output, may be configured with its own OutputFormat, with its own key class and with its own value class.
    //案例二:将不同的数据写到不同的文件中
    Case two: to write data to different files provided by user
    //MultipleOutputs支持计数器,默认是不启用状态.计数器组名是MultipleOutputs类的名字.计数器名字是自定义输出的名字.将记录个数写入对应的计数器. MultipleOutputs supports counters, by
    default they are disabled. The counters group is the MultipleOutputs class name. The names of the counters are the same as the output name. These count the number records written to each output name.
    //Job配置模板
    Usage pattern
    for job submission: Job job = new Job(); FileInputFormat.setInputPath(job, inDir); FileOutputFormat.setOutputPath(job, outDir); job.setMapperClass(MOMap.class); job.setReducerClass(MOReduce.class); ... //定义TextOutputFormat格式的'text'输出 MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, LongWritable.class, Text.class); //定义SequenceFileOutputFormat格式的'seq'输出 MultipleOutputs.addNamedOutput(job, "seq", SequenceFileOutputFormat.class, LongWritable.class, Text.class); ... job.waitForCompletion(true); ... //reduce中使用 Usage in Reducer: String generateFileName(K k, V v) { return k.toString() + "_" + v.toString(); } public class MOReduce extends Reducer<WritableComparable, Writable,WritableComparable, Writable> {

    //1. 定义MultipleOutputs类型变量
    private MultipleOutputs mos;
    public void setup(Context context) { ... //2. setup()方法对其初始化
    mos
    = new MultipleOutputs(context); } public void reduce(WritableComparable key, Iterator<Writable> values, Context context) throws IOException { ... mos.write("text", , key, new Text("Hello"));
    //3. reduce()方法中使用MultipleOutputs类的write方法输出

    /**
    *参数列表
    * @ 自定义的输出名
    * @ 输出的key
    * @ 输出的value
    * @ 输出的基础路径
    */

    mos.write(
    "seq", LongWritable(1), new Text("Bye"), "seq_a"); mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b"); mos.write(key, new Text("value"), generateFileName(key, new Text("value"))); ... } public void cleanup(Context) throws IOException {

    //4. 关闭MultipleOutputs输出流
    mos.close(); ... } }
    When used in conjuction with org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat, MultipleOutputs can mimic the behaviour of MultipleTextOutputFormat and MultipleSequenceFileOutputFormat from the old Hadoop API
    - ie, output can be written from the Reducer to more than one location.
    //使用以下方法可以不用指定自定义输出 Use MultipleOutputs.write(KEYOUT key, VALUEOUT value, String baseOutputPath) to write key and value to a path specified by baseOutputPath, with no need to specify a named output:
    //定义变量
    private MultipleOutputs out; public void setup(Context context) {
    //初始化变量
    out
    = new MultipleOutputs(context); ... } public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { for (Text t : values) {
    //调用类中的Write()方法
    /**
    *参数列表
    * @ 输出的key
    * @ 输出的value
    * @ 指定输出的基础路径
    */

    out.write(key, t, generateFileName(
    <parameter list...>)); } } protected void cleanup(Context context) throws IOException, InterruptedException {
    //关闭输出流
    out.close(); }
    //自定义的生成基础路径的方法,即符号"/"有无的区别 Use your own code in generateFileName() to create a custom path to your results.
    '/' characters in baseOutputPath will be translated into directory levels in your file system. Also, append your custom-generated path with "part" or similar, otherwise your output will be -00000, -00001 etc. No call to context.write() is necessary. See example generateFileName() code below. private String generateFileName(Text k) { // expect Text k in format "Surname|Forename" String[] kStr = k.toString().split("\|"); String sName = kStr[0]; String fName = kStr[1]; // example for k = Smith|John // output written to /user/hadoop/path/to/output/Smith/John-r-00000 (etc) return sName + "/" + fName; }
    //以上使用MultipleOutputs类的方法方式都会产生一个空的默认的【part-*-00000】的文件.
    //在Job的配置中使用 LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
    //代替
    job.setOutputFormatClass(TextOutputFormat.class);
    //可以避免差生【part-*-00000】这一空文件
    Using MultipleOutputs in
    this way will still create zero-sized default output, eg part-00000. To prevent this use LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); instead of job.setOutputFormatClass(TextOutputFormat.class); in your Hadoop job configuration.

    总结:MR案例:多文件输出MultipleOutputs

    • 使用指定 自定义输出 的write方法需要在Job配置中添加 MultipleOutputs.addNamedOutput(Job job, String namedOutput, Class<? extends OutputFormat> outputFormatClass, Class<?> keyClass, Class<?> valueClass);方法
    • 对于不使用指定 自定义输出 的write方法则不需要
    • Job结果中不再产生默认的空文件【part-*-00000】需要在置中使用 LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
  • 相关阅读:
    代理
    博客园主题
    JS_1
    脚本语言
    Hadoop生态体系
    Hadoop序列化程序报错
    46. 全排列
    1038 Recover the Smallest Number (30分)
    1064 Complete Binary Search Tree (30分)
    1034 Head of a Gang (30分)
  • 原文地址:https://www.cnblogs.com/skyl/p/4753696.html
Copyright © 2011-2022 走看看