zoukankan      html  css  js  c++  java
  • 分布式处理框架MapReduce

    一.概述

    • MapReduce源自 Google的MapReduce论文,发表于2004年12月
    • 优点:海量数据离线处理&易开发&易运行
    • 缺点:实时流式运算困难

    二.wordcount分词系统案例入门

      

      输入通过InputFormat读取,每读一行交由map处理,经过Shuffling分序丢到Reducing上面处理,最后通过OutputFormat把记录输出到文件系统(HDFS)上面去。

      java源码:

      

    package com.cracker.hadoop.mapreduce;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    import java.io.IOException;
    
    /**
     * 使用MapReduce开发WordCount应用程序
     */
    public class WordCountApp {
    
        /**
         * Map:读取输入的文件
         */
        public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    
            LongWritable one = new LongWritable(1);
    
            @Override
            protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
                // 接收到的每一行数据
                String line = value.toString();
    
                //按照指定分隔符进行拆分
                String[] words = line.split(" ");
    
                for (String word : words) {
                    // 通过上下文把map的处理结果输出
                    context.write(new Text(word), one);
                }
    
            }
        }
    
        /**
         * Reduce:归并操作
         */
        public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
    
            @Override
            protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException,
                    InterruptedException {
    
                long sum = 0;
                for (LongWritable value : values) {
                    // 求key出现的次数总和
                    sum += value.get();
                }
    
                // 最终统计结果的输出
                context.write(key, new LongWritable(sum));
            }
        }
    
        /**
         * 定义Driver:封装了MapReduce作业的所有信息
         */
        public static void main(String[] args) throws Exception {
    
            //创建Configuration
            Configuration configuration = new Configuration();
    
            //创建Job
            Job job = Job.getInstance(configuration, "wordcount");
    
            //设置job的处理类
            job.setJarByClass(WordCountApp.class);
    
            //设置作业处理的输入路径
            FileInputFormat.setInputPaths(job, new Path(args[0]));
    
            //设置map相关参数
            job.setMapperClass(MyMapper.class);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(LongWritable.class);
    
            //设置reduce相关参数
            job.setReducerClass(MyReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(LongWritable.class);
    
            //设置作业处理的输出路径
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    }
    View Code

      相关命令

      本地编译

      mvn clean package -DskipTests

      服务器

      hadoop jar /root/app/hadoop-train-1.0.jar com.cracker.hadoop.mapreduce.WordCountApp hdfs://localhost:8020/hello.txt  hdfs://localhost:8020/output/wc

      

     

  • 相关阅读:
    svn使用总结
    捕获JS 错误日志
    致敬 54岁的刘德华
    Mac 下 命令收藏
    坑人的七牛CDN
    【No.1】监控Linux性能25个命令行工具
    PAC 自动代理
    jquery 事件 多次绑定,多次触发,怎么清除历史绑定事件
    Squid 操作实践
    ntpdate[16603]: the NTP socket is in use
  • 原文地址:https://www.cnblogs.com/cracker13/p/10084098.html
Copyright © 2011-2022 走看看