zoukankan      html  css  js  c++  java
  • MapReduce 气象数据集

    通过MapReduce程序分析气象数据集,更好的了解计算过程。

    环境:Hadoop 1.2.1 & Centos 6.5 x64

    1、气象数据集准备

    下载链接:ftp://ftp3.ncdc.noaa.gov/pub/data

    完整数据集非常大,可以下载部分数据集作为日常实验数据。

    2、气象数据上传到HDFS

    [huser@master 1971]$ ls
    034700-99999-1971.gz  273730-99999-1971.gz  338850-99999-1971.gz  943290-99999-1971.gz
    035623-99999-1971.gz  273930-99999-1971.gz  338870-99999-1971.gz  943320-99999-1971.gz
    035833-99999-1971.gz  274020-99999-1971.gz  338890-99999-1971.gz  943330-99999-1971.gz
    035963-99999-1971.gz  274120-99999-1971.gz  338930-99999-1971.gz  943350-99999-1971.gz
    036880-99999-1971.gz  274280-99999-1971.gz  338960-99999-1971.gz  943400-99999-1971.gz
    040180-16201-1971.gz  274790-99999-1971.gz  338980-99999-1971.gz  943430-99999-1971.gz
    041650-99999-1971.gz  274850-99999-1971.gz  339020-99999-1971.gz  943549-99999-1971.gz
    041750-99999-1971.gz  275020-99999-1971.gz  339070-99999-1971.gz  943550-99999-1971.gz
    042350-99999-1971.gz  275090-99999-1971.gz  339100-99999-1971.gz  943660-99999-1971.gz
    061800-99999-1971.gz  275320-99999-1971.gz  339150-99999-1971.gz  943670-99999-1971.gz
    [huser@master 1971]$ zcat *.gz > sample.txt
    [huser@master hadoop-1.2.1]$ bin/hadoop fs -put /home/huser/hadoop/1971/sample.txt /user/huser/in/

    3、编写MapReduce程序

    参考权威指南,摘出部分程序,计算年份最高气温

    import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    public class MaxTemperatureMapper extends
            Mapper<LongWritable, Text, Text, IntWritable> {
        private static final int MISSING = 9999;
    
        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString();
            String year = line.substring(15, 19);
            int airTemperature;
            if (line.charAt(87) == '+') { // parseInt doesn't like leading plus
                                            // signs
                airTemperature = Integer.parseInt(line.substring(88, 92));
            } else {
                airTemperature = Integer.parseInt(line.substring(87, 92));
            }
            String quality = line.substring(92, 93);
            if (airTemperature != MISSING && quality.matches("[01459]")) {
                context.write(new Text(year), new IntWritable(airTemperature));
            }
        }
    }
    import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    public class MaxTemperatureReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int maxValue = Integer.MIN_VALUE;
            for (IntWritable value : values) {
                maxValue = Math.max(maxValue, value.get());
            }
            context.write(key, new IntWritable(maxValue));
        }
    }
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    public class MaxTemperature {
        public static void main(String[] args) throws Exception {
            if (args.length != 2) {
                System.err
                        .println("Usage: MaxTemperature <input path> <output path>");
                System.exit(-1);
            }
            Job job = new Job();
            job.setJarByClass(MaxTemperature.class);
            job.setJobName("Max temperature");
            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
            job.setMapperClass(MaxTemperatureMapper.class);
            job.setReducerClass(MaxTemperatureReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    }

    4、编译程序

    [huser@master bin]$ javac -classpath ../hadoop-core-1.2.1.jar *.java

    5、运行程序

    [huser@master bin]$ ../bin/hadoop MaxTemperature ./in/sample.txt ./out6
    Warning: $HADOOP_HOME is deprecated.
    
    14/04/18 15:31:15 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    14/04/18 15:31:16 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    14/04/18 15:31:16 INFO input.FileInputFormat: Total input paths to process : 1
    14/04/18 15:31:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
    14/04/18 15:31:16 WARN snappy.LoadSnappy: Snappy native library not loaded
    14/04/18 15:31:17 INFO mapred.JobClient: Running job: job_201404181009_0003
    14/04/18 15:31:18 INFO mapred.JobClient:  map 0% reduce 0%
    14/04/18 15:31:33 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000002_0, Status : FAILED
    java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper
            at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
            at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
            at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:415)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
            at org.apache.hadoop.mapred.Child.main(Child.java:249)
    Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper
            at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
            at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
            at java.security.AccessController.doPrivileged(Native Method)
            at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
            at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
            at java.lang.Class.forName0(Native Method)
            at java.lang.Class.forName(Class.java:270)
            at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
            at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
            ... 8 more
    
    14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000002_0&filter=stdout
    14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000002_0&filter=stderr
    14/04/18 15:31:33 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000003_0, Status : FAILED
    14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000003_0&filter=stdout
    14/04/18 15:31:33 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave1:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000003_0&filter=stderr
    14/04/18 15:31:37 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000000_0, Status : FAILED
    java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper
            at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
            at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
            at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:415)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
            at org.apache.hadoop.mapred.Child.main(Child.java:249)
    Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper
            at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
            at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
            at java.security.AccessController.doPrivileged(Native Method)
            at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
            at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
            at java.lang.Class.forName0(Native Method)
            at java.lang.Class.forName(Class.java:270)
            at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
            at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
            ... 8 more
    
    14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000000_0&filter=stdout
    14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000000_0&filter=stderr
    14/04/18 15:31:37 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000001_0, Status : FAILED
    java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper
            at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
            at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
            at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:415)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
            at org.apache.hadoop.mapred.Child.main(Child.java:249)
    Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper
            at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
            at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
            at java.security.AccessController.doPrivileged(Native Method)
            at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
            at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
            at java.lang.Class.forName0(Native Method)
            at java.lang.Class.forName(Class.java:270)
            at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
            at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
            ... 8 more
    
    14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000001_0&filter=stdout
    14/04/18 15:31:37 WARN mapred.JobClient: Error reading task outputServer returned HTTP response code: 401 for URL: http://slave2:50060/tasklog?plaintext=true&attemptid=attempt_201404181009_0003_m_000001_0&filter=stderr
    14/04/18 15:31:41 INFO mapred.JobClient: Task Id : attempt_201404181009_0003_m_000006_0, Status : FAILED
    java.lang.RuntimeException: java.lang.ClassNotFoundException: MaxTemperatureMapper
            at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
            at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
            at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:415)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
            at org.apache.hadoop.mapred.Child.main(Child.java:249)
    Caused by: java.lang.ClassNotFoundException: MaxTemperatureMapper
            at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
            at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
            at java.security.AccessController.doPrivileged(Native Method)
            at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
            at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
            at java.lang.Class.forName0(Native Method)
            at java.lang.Class.forName(Class.java:270)
            at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
            at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
            ... 8 more

    报错原因是因为JAVA程序有三个类,运行程序找不到调用类,需要打成JAR包。

    [huser@master bin]$ jar cvf MaxTemperature.jar *.class
    已添加清单
    正在添加: MaxTemperature.class(输入 = 1418) (输出 = 801)(压缩了 43%)
    正在添加: MaxTemperatureMapper.class(输入 = 1876) (输出 = 804)(压缩了 57%)
    正在添加: MaxTemperatureReducer.class(输入 = 1664) (输出 = 707)(压缩了 57%)
    
    [huser@master bin]$ ls
    hadoop                      MaxTemperatureMapper.java    start-jobhistoryserver.sh
    hadoop-config.sh            MaxTemperatureReducer.class  start-mapred.sh
    hadoop-daemon.sh            MaxTemperatureReducer.java   stop-all.sh
    hadoop-daemons.sh           rcc                          stop-balancer.sh
    MaxTemperature.class        slaves.sh                    stop-dfs.sh
    MaxTemperature.jar          start-all.sh                 stop-jobhistoryserver.sh
    MaxTemperature.java         start-balancer.sh            stop-mapred.sh
    MaxTemperatureMapper.class  start-dfs.sh                 task-controller
    
    [huser@master bin]$ rm -rf *.class

    以JAR包方式运行程序

    [huser@master bin]$ ../bin/hadoop jar MaxTemperature.jar MaxTemperature ./in/sample.txt ./out7
    Warning: $HADOOP_HOME is deprecated.

    14/04/18 15:42:35 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments Applications should implement Tool for the same.
    14/04/18 15:42:48 INFO input.FileInputFormat: Total input paths to process : 1
    14/04/18 15:42:48 INFO util.NativeCodeLoader: Loaded the native-hadoop library
    14/04/18 15:42:48 WARN snappy.LoadSnappy: Snappy native library not loaded
    14/04/18 15:43:50 INFO mapred.JobClient: Running job: job_201404181009_0005
    14/04/18 15:43:52 INFO mapred.JobClient: map 0% reduce 0%
    14/04/18 15:51:04 INFO mapred.JobClient: map 1% reduce 0%
    14/04/18 15:51:42 INFO mapred.JobClient: map 2% reduce 0%
    14/04/18 15:51:43 INFO mapred.JobClient: map 10% reduce 0%
    14/04/18 15:52:46 INFO mapred.JobClient: map 11% reduce 0%
    14/04/18 15:53:03 INFO mapred.JobClient: map 12% reduce 0%
    14/04/18 15:53:14 INFO mapred.JobClient: map 13% reduce 0%
    14/04/18 15:53:16 INFO mapred.JobClient: map 14% reduce 0%
    14/04/18 15:53:19 INFO mapred.JobClient: map 15% reduce 0%
    14/04/18 15:53:22 INFO mapred.JobClient: map 16% reduce 0%
    14/04/18 15:53:32 INFO mapred.JobClient: map 18% reduce 0%
    14/04/18 15:54:09 INFO mapred.JobClient: map 19% reduce 0%
    14/04/18 16:00:36 INFO mapred.JobClient: map 98% reduce 26%
    14/04/18 16:00:41 INFO mapred.JobClient: map 98% reduce 30%
    14/04/18 16:00:45 INFO mapred.JobClient: map 100% reduce 30%
    14/04/18 16:00:56 INFO mapred.JobClient: map 100% reduce 33%
    14/04/18 16:01:13 INFO mapred.JobClient: map 100% reduce 100%
    14/04/18 16:01:25 INFO mapred.JobClient: Job complete: job_201404181009_0005
    14/04/18 16:01:25 INFO mapred.JobClient: Counters: 30
    14/04/18 16:01:25 INFO mapred.JobClient: Job Counters
    14/04/18 16:01:25 INFO mapred.JobClient: Launched reduce tasks=1
    14/04/18 16:01:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=2001708
    14/04/18 16:01:25 INFO mapred.JobClient: Total time spent by all reduces waiting after eserving slots (ms)=0
    14/04/18 16:01:25 INFO mapred.JobClient: Total time spent by all maps waiting after resrving slots (ms)=0
    14/04/18 16:01:25 INFO mapred.JobClient: Rack-local map tasks=3
    14/04/18 16:01:25 INFO mapred.JobClient: Launched map tasks=11
    14/04/18 16:01:25 INFO mapred.JobClient: Data-local map tasks=8
    14/04/18 16:01:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=638749
    14/04/18 16:01:25 INFO mapred.JobClient: File Output Format Counters
    14/04/18 16:01:25 INFO mapred.JobClient: Bytes Written=9
    14/04/18 16:01:25 INFO mapred.JobClient: FileSystemCounters
    14/04/18 16:01:25 INFO mapred.JobClient: FILE_BYTES_READ=111429430
    14/04/18 16:01:25 INFO mapred.JobClient: HDFS_BYTES_READ=1311937676
    14/04/18 16:01:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=167764543
    14/04/18 16:01:25 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=9
    14/04/18 16:01:25 INFO mapred.JobClient: File Input Format Counters
    14/04/18 16:01:25 INFO mapred.JobClient: Bytes Read=1311936596
    14/04/18 16:01:25 INFO mapred.JobClient: Map-Reduce Framework
    14/04/18 16:01:25 INFO mapred.JobClient: Map output materialized bytes=55714697
    14/04/18 16:01:25 INFO mapred.JobClient: Map input records=5140229
    14/04/18 16:01:25 INFO mapred.JobClient: Reduce shuffle bytes=55714697
    14/04/18 16:01:25 INFO mapred.JobClient: Spilled Records=15194901
    14/04/18 16:01:25 INFO mapred.JobClient: Map output bytes=45584703
    14/04/18 16:01:25 INFO mapred.JobClient: Total committed heap usage (bytes)=2127904768
    14/04/18 16:01:25 INFO mapred.JobClient: CPU time spent (ms)=118580
    14/04/18 16:01:25 INFO mapred.JobClient: Combine input records=0
    14/04/18 16:01:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=1080
    14/04/18 16:01:25 INFO mapred.JobClient: Reduce input records=5064967
    14/04/18 16:01:25 INFO mapred.JobClient: Reduce input groups=1
    14/04/18 16:01:25 INFO mapred.JobClient: Combine output records=0
    14/04/18 16:01:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=1685221376
    14/04/18 16:01:25 INFO mapred.JobClient: Reduce output records=1
    14/04/18 16:01:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=7951810560
    14/04/18 16:01:25 INFO mapred.JobClient: Map output records=5064967

    查看结果

    [huser@master bin]$ ../bin/hadoop fs -cat ./out7/part-r-00000
    Warning: $HADOOP_HOME is deprecated.
    
    1971    478
  • 相关阅读:
    Eclipse Alt + / 无提示
    洛谷 P1101 单词方阵
    力扣题解 7th 整数反转
    力扣题解 344th 反转字符串
    力扣题解 48th 旋转图像
    力扣题解 36th 有效的数独
    力扣题解 1th 两数之和
    力扣题解 283th 移动零
    力扣题解 66th 加一
    力扣题解 350th 两个数组的交集 II
  • 原文地址:https://www.cnblogs.com/guarder/p/3744766.html
Copyright © 2011-2022 走看看