zoukankan      html  css  js  c++  java
  • WordCount Example example

    WordCount Example

    package org.myorg;
            
    import java.io.IOException;
    import java.util.*;
            
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.conf.*;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapreduce.*;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
            
    public class WordCount {
            
     public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
            
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                context.write(word, one);
            }
        }
     } 
            
     public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

        public void reduce(Text key, Iterable<IntWritable> values, Context context) 
          throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
     }
            
     public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
            
            Job job = new Job(conf, "wordcount");
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
            
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
            
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
            
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
            
        job.waitForCompletion(true);
     }
            
    }

        WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

    As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of data sent across the network by combining each word into a single record.

         该例子是用于读取文本文件并且统计单词的频率.输入的是文本文件并且输出的也是文本文件,它每一行包含一个单词和这个单词在文本出现的次数 ,atab.each分离取得一行作为输出并且分离成多个单词.它因此释放出一对key /value 的单词和1.

    每一个reduce每一个单词的总和并且释放出一个唯一的key/value.作为一个优化,这个reducer同样也用于汇合这个map的输出。 这个reduces将这些大量的数据通过网络进行汇集,将每一个单词汇集成一个单一的记录。(非常的拗口)

    To run the example, the command syntax is
    bin/hadoop jar hadoop-*-examples.jar wordcount [-m <#maps>] [-r <#reducers>] <in-dir> <out-dir>

    All of the files in the input directory (called in-dir in the command line above) are read and the counts of words in the input are written to the output directory (called out-dir above). It is assumed that both inputs and outputs are stored in HDFS (see ImportantConcepts). If your input is not already in HDFS, but is rather in a local file system somewhere, you need to copy the data into HDFS using a command like this:

    bin/hadoop dfs -mkdir <hdfs-dir>
    bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir>

    As of version 0.17.2.1, you only need to run a command like this:
    bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir>

    Word count supports generic options : see DevelopmentCommandLineOptions
  • 相关阅读:
    CVPR2020:三维实例分割与目标检测
    CVPR2020:视觉导航的神经拓扑SLAM
    使用现代C++如何避免bugs(下)
    使用现代C++如何避免bugs(上)
    蓝牙mesh网络技术的亮点
    电路功能和优点
    ARM的突破:超级计算机和Mac
    所有处理都走向AI
    Wide-Bandgap宽禁带(WBG)器件(如GaN和SiC)市场将何去何从?
    功率半导体碳化硅(SiC)技术
  • 原文地址:https://www.cnblogs.com/chenli0513/p/2290742.html
Copyright © 2011-2022 走看看