【hadoop】——MapReduce解压缩实现

zoukankan html css js c++ java

【hadoop】——MapReduce解压缩实现
转载请注明出处：http://www.cnblogs.com/zhengrunjian/p/4527269.html

1作为输入

当压缩文件做为mapreduce的输入时，mapreduce将自动通过扩展名找到相应的codec对其解压。

如果我们压缩的文件有相应压缩格式的扩展名（比如lzo，gz，bzip2等），hadoop就会根据扩展名去选择解码器解压。

hadoop对每个压缩格式的支持,详细见下表：

如果压缩的文件没有扩展名，则需要在执行mapreduce任务的时候指定输入格式.
[java] view plain copy

hadoop jar /usr/home/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-streaming-0.20.2-CDH3B4.jar

-file /usr/home/hadoop/hello/mapper.py -mapper /usr/home/hadoop/hello/mapper.py

-file /usr/home/hadoop/hello/reducer.py -reducer /usr/home/hadoop/hello/reducer.py

-input lzotest -output result4

-jobconf mapred.reduce.tasks=1

-inputformat org.apache.hadoop.mapred.LzoTextInputFormat
2作为输出

当mapreduce的输出文件需要压缩时，可以更改mapred.output.compress为true，mapped.output.compression.codec为想要使用的codec的类名就

可以了，当然你可以在代码中指定，通过调用FileOutputFormat的静态方法去设置这两个属性，我们来看代码：
[java] view plain copy

package com.sweetop.styhadoop;



import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.compress.GzipCodec;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;



import java.io.IOException;



/**

* Created with IntelliJ IDEA.

* User: lastsweetop

* Date: 13-6-27

* Time: 下午7:48

* To change this template use File | Settings | File Templates.

*/

public class MaxTemperatureWithCompression {

    public static void main(String[] args) throws Exception {

        if (args.length!=2){

            System.out.println("Usage: MaxTemperature <input path> <out path>");

            System.exit(-1);

        }

        Job job=new Job();

        job.setJarByClass(MaxTemperature.class);

        job.setJobName("Max Temperature");



        FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));



        job.setMapperClass(MaxTemperatrueMapper.class);

        job.setCombinerClass(MaxTemperatureReducer.class);

        job.setReducerClass(MaxTemperatureReducer.class);



        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);



        FileOutputFormat.setCompressOutput(job, true);

        FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);



        System.exit(job.waitForCompletion(true)?0:1);



    }

}

输入也是一个压缩文件
[plain] view plain copy

~/hadoop/bin/hadoop com.sweetop.styhadoop.MaxTemperatureWithCompression input/data.gz output/
输出的每一个part都会被压缩，我们这里只有一个part，看下压缩了的输出
[plain] view plain copy

[hadoop@namenode test]$hadoop fs -get output/part-r-00000.gz .

[hadoop@namenode test]$ls

1901  1902  ch2  ch3  ch4  data.gz  news.gz  news.txt  part-r-00000.gz

[hadoop@namenode test]$gunzip -c part-r-00000.gz

1901<span style="white-space:pre">  </span>317

1902<span style="white-space:pre">  </span>244

如果你要将序列文件做为输出，你需要设置mapred.output.compression.type属性来指定压缩类型，默认是RECORD类型，它会按单个的record压缩，如果指定为BLOCK类型，它将一组record压缩，压缩效果自然是BLOCK好。
当然代码里也可以设置，你只需调用SequenceFileOutputFormat的setOutputCompressionType方法进行设置。
[plain] view plain copy

SequenceFileOutputFormat.setOutputCompressionType(job, SequenceFile.CompressionType.BLOCK);

如果你用Tool接口来跑mapreduce的话，可以在命令行设置这些参数，明显比硬编码好很多
3压缩map输出

即使你的mapreduce的输入输出都是未压缩的文件，你仍可以对map任务的中间输出作压缩，因为它要写在硬盘并且通过网络传输到reduce节点，对其压

缩可以提高很多性能，这些工作也是只要设置两个属性即可，我们看下代码里怎么设置：
[java] view plain copy

Configuration conf = new Configuration();

    conf.setBoolean("mapred.compress.map.output", true);

    conf.setClass("mapred.map.output.compression.codec",GzipCodec.class, CompressionCodec.class);

    Job job=new Job(conf);

转至：http://blog.csdn.net/lastsweetop/article/details/9187721
查看全文

相关阅读:
华为机试题01背包问题
 丑数
 动态规划(1)
Linux 后台启动 Redis
redis.exceptions.ResponseError: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk.
SQLServer从渣仔到小白
 cmder 增强型命令行工具
 总结在部署分布式爬虫环境过程中常见的若干问题
 【pymongo.errors】Cursor not found
浅析scrapy与scrapy_redis区别

原文地址：https://www.cnblogs.com/zhengrunjian/p/4527269.html

【hadoop】——MapReduce解压缩实现

转载请注明出处：http://www.cnblogs.com/zhengrunjian/p/4527269.html

1作为输入

2作为输出

3压缩map输出