zoukankan      html  css  js  c++  java
  • HADOOP之MAPREDUCE程序应用二

    摘要:MapReduce程序进行单词计数。

    关键词:MapReduce程序  单词计数

    数据源:人工构造英文文档file1.txt,file2.txt。

    file1.txt 内容

    Hello   Hadoop

    I   am  studying   the   Hadoop  technology

    file2.txt内容

    Hello  world

    The  world  is  very  beautiful

    I   love    the   Hadoop    and    world

    问题描写叙述:

    统计人工构造的英文文档中单词的频数,要求输出的结果依照单词字母的顺序进行排序。

    解决方式:

    1  开发工具:VM10+ Ubuntu12.04+ Hadoop1.1.2

    2  设计思路:把英文文档内容且分成单词,然后把全部同样的单词聚集在一起,最后计算各个单词的频数。

    程序清单:

    package com.wangluqing;

    import java.io.IOException;
    import java.util.StringTokenizer;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    import org.apache.hadoop.util.GenericOptionsParser;

    public class WordCount {
    public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context) throws IOException,InterruptedException {
    StringTokenizer its = new StringTokenizer(value.toString());

    while (its.hasMoreTokens()) {
    word.set(its.nextToken());
    context.write(word,one);
    }

    }
    }

    public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    int sum = 0;
    for(IntWritable val:values) {
    sum += val.get();
    }
    result.set(sum);
    context.write(key,result);
    }
    }

    public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();
    if(otherArgs.length !=2 ) {
    System.err.println("Usage:wordcount<in><out>");
    System.exit(2);
    }
    Job job = new Job(conf,"word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true)?0:1);
    }
    }

    3 运行程序

    1)创建输入文件夹

    hadoop  fs  -mkdir   wordcount_input

    2)上传本地英文文档

    hadoop  fs -put  /usr/local/datasource/article/*   wordcount_input

    3)编译WordCount.java程序,把结果存放在当前文件夹的WordCount文件夹下。

    root@hadoop:/usr/local/program/hadoop# javac -classpath hadoop-core-1.1.2.jar:lib/commons-cli-1.2.jar -d WordCount WordCount.java

    4) 将编译结果打成Jar包

    jar -cvf  wordcount.jar   WordCount/  .

    5)执行WordCount程序,输入文件夹为wordcount_input,输出文件夹为wordcount_output。

    hadoop jar wordcount.jar  com.wangluqing.WordCount  wordcount_input  wordcount_output

    6) 查看各个单词频数结果

    root@hadoop:/usr/local/program/hadoop# hadoop fs -cat wordcount_output/part-r-00000

    Hadoop 3
    Hello 2
    I 2
    The 1
    am 1
    and 1
    beautiful 1
    is 1
    love 1
    studying 1
    technology 1
    the 2
    very 1
    world 3

    总结:

    WordCount程序是最简单也是最具代表性的MapReduce程序,一定程度上MapReduce设计的初衷,即对日志文件的分析。

    Resource:

     http://www.wangluqing.com/2014/03/hadoop-mapreduce-programapp2/

    2  《Hadoop实战 第二版》陆嘉恒著 第5章 MapReduce应用案例

     

  • 相关阅读:
    【Log历练手册】Spring事务管理不能提交异常
    【网络安全】如何使用OpenSSL工具生成根证书与应用证书
    【网络安全】如何使用OpenSSL工具生成根证书与应用证书
    【JAVA笔记——器】Spring Aop 实现Log日志系统——基本实现
    jdbc连接池配置方法
    用于读/写配置的工具,下面列出了各种配置(从最高优先级到最低优先级)
    文件复制Util写法,可以适用于多种条件
    记录一个工作中遇到的问题,svn拉的项目,pom.xml报错
    layui的js写法,部分代码
    JDBCUtil连接数据库的写法
  • 原文地址:https://www.cnblogs.com/zfyouxi/p/4206285.html
Copyright © 2011-2022 走看看