zoukankan      html  css  js  c++  java
  • WordCount Analysis

    1.Create a new java project, then copy examples folder from /home/hadoop/hadoop-1.0.4/src;

    Create a new folder named src, then Paste to the project to this folder.

    Error: Could not find or load main class

    right-click src folder, --> build Path --> Use as source Folder

    2.Copy hadoop-1.0.4-eclipse-plugin.jar to eclipse/plugin . Then restart eclipse.

    3.Set the hadoop install directory and configure the hadoop location.

    clip_image001

    4.Attched the hadoop source code for the project, then you can check hadoop source code freely.

    5.Java heap space Error

    java.lang.OutOfMemoryError: Java heap space
    
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949)
    
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674)
    
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
    
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
    
    int maxMemUsage = sortmb << 20;
    
    int recordCapacity = (int)(maxMemUsage * recper);
    
    recordCapacity -= recordCapacity % RECSIZE;
    
    kvbuffer = new byte[maxMemUsage - recordCapacity];

    so we should configure the value of io.sort.mb to avoid this.

    我运行的机器环境配置比较低,three nodes, all 512M memory .

    我没有在core-site.xml中设置这个参数的值,为了这次job,我直接设置在job的driver code中,

    conf.set("io.sort.mb","10");

    6.sample test data for WordCount:

    10
    
    9
    
    8
    
    7
    
    6
    
    5
    
    4
    
    3
    
    2
    
    1
    
    line1
    
    line3
    
    line2
    
    line5
    
    Line4
    
    运行结果文件是:
    
    1        1
    
    10        1
    
    2        1
    
    3        1
    
    4        1
    
    5        1
    
    6        1
    
    7        1
    
    8        1
    
    9        1
    
    line1        2
    
    line2        2
    
    line3        2
    
    line4        2
    
    line5        2
    
    line6        1

    还有一个文件是_Success.表明job执行成功。

    可以看到执行后的文件是排过序的。是根据key 值的类型进行排序的,我们wordcount示例中,key值是string类型。

    7.在Wordcount示例中,没有专门处理如果输出目录已经存在的情况,为了方便测试,我们添加如下的代码来处理目录.

    Path outPath = new Path(args[1]);
    
    FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
    
    if (dfs.exists(outPath)) {
    
    dfs.delete(outPath, true);
    
    }

    8.why the wordcount demo 's mapper and reduce class are both static?

    (为什么WordCount示例中的mapper和reducer都设计成static的,难道非要这样吗?)

    Let me remove the static key word for mapper class, then run the job, you will get exception as follow:

    java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.examples.WordCount$TokenizerMapper.<init>()
    
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
    
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
    
    Caused by: java.lang.NoSuchMethodException: org.apache.hadoop.examples.WordCount$TokenizerMapper.<init>()
    
    at java.lang.Class.getConstructor0(Class.java:2730)

    在这个时候,mapper类变成了wordcount类的内部类,反射辅助类无法准备地找到它的构造函数,无法实例化。

    解决方案,把mapper类从内部类转成非内部类,从wordcount类中拿出来,放到外面去或另起一个文件,这样

    执行依然可以。

    我们可以看到,我们的示例,尽可能地简单,都放在一个类里面了,使用static就可以保证可以正确运行,如果我们的mapper和reducer不是特别复杂,这样的设计也无可厚非。如果复杂的话,最好单拎出来放一个类。

    9.默认我们在eclipse里面直接调试运行或直接运行的时候,我们并非是执行在hadoop cluster上面的,而是进程中模拟执行的,这样方便我们进行调试,我们可以看到console中会有输出类似LocalJobRunner的字样,而不是JobTracker去执行。

    这就是为什么,即使我们设置reducetask number大于1的时候,我们仍会在输出的目录里面看到一个part-0000之类的输出,是因为localjobrunner只支持一个.

    为了方便我们直接在这里写完代码,就模拟在集群上执行,是很有必要的,有时候是因为你写的代码不在集群上执行就

    不能及时地发现错误(分布式应用程序写的时候还是需要注意很多事项的)。

    因为提交到集群其实需要做的一件事就是打包你的代码为jar文件,然后提交到集群中去,所以这里需要做这些事情。

    我使用spork兄的EJob类来完成这件事,如果你熟悉可以自己写,可以参照http://www.cnblogs.com/spork/archive/2010/04/21/1717592.html.

    参照文章,然后在驱动代码中进行部分调整即可。

    10.

    如果我想把单词中第一个字母小于N的放在第一个reduce task中完成,其他的放在第二个reduce task中输出,该怎么做呢?

    写自己的partitioner类,默认的partitioner类是HashPartitioner类,我们简单实现自己的,然后设置一下就可以了。

    11.附上修改后的完整的WordCount类源码:

    package org.apache.hadoop.examples;
    
    import java.io.File;
    import java.io.IOException;
    import java.util.StringTokenizer;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Partitioner;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.util.GenericOptionsParser;
     
    class TokenizerMapper 
    extends Mapper<Object, Text, Text, IntWritable>{
    
    private final   IntWritable one = new IntWritable(1);
    private Text word = new Text();
    
    @SuppressWarnings("unused")
    public void map(Object key, Text value, Context context
                 ) throws IOException, InterruptedException {
    if(false){
    	StringTokenizer itr = new StringTokenizer(value.toString());
    		while (itr.hasMoreTokens()) {
    		 word.set(itr.nextToken());
    		 context.write(word, one);
    		}
    }else
    {	
    	String s = value.toString();
    	String[] words = s.split("\s+");
    	for (int i = 0; i < words.length; i++) {
    	    words[i] = words[i].replaceAll("[^\w]", "");
    	   // System.out.println(words[i]);
    	    word.set(words[i].toUpperCase());
    	    if(words[i].length()>0)
    	    	context.write(word,one);
    	}
    }		
    
    	}
    }
    
    public class WordCount {
    
    public static class MyPartitioner<K, V> extends Partitioner<K, V> {
    
    		  public int getPartition(K key, V value,
    		                          int numReduceTasks) {
    			  if(key.toString().toUpperCase().toCharArray()[0]<'N') return 0;
    			  else return 1;
    		  }
    }
    
      public static class IntSumReducer 
           extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();
    
        public void reduce(Text key, Iterable<IntWritable> values, 
                           Context context
                           ) throws IOException, InterruptedException {
          int sum = 0;
          for (IntWritable val : values) {
            sum += val.get();
          }
          result.set(sum);
          context.write(key, result);
        }
      }
    
      public static void main(String[] args) throws Exception {
    	args= "hdfs://namenode:9000/user/hadoop/englishwords hdfs://namenode:9000/user/hadoop/out".split(" ");
    	
    	File jarFile = EJob.createTempJar("bin");
    	EJob.addClasspath("/home/hadoop/hadoop-1.0.4/conf");
    	//conf.set("mapred.job.tracker","namenode:9001");
    	ClassLoader classLoader = EJob.getClassLoader();
    	Thread.currentThread().setContextClassLoader(classLoader);
    	
    	
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
          System.err.println("Usage: wordcount <in> <out>");
          System.exit(2);
        }
        //drop output directory if exists
        Path outPath = new Path(args[1]);
    	FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
    	if (dfs.exists(outPath)) {
    		dfs.delete(outPath, true);
    	}
    	
    	conf.set("io.sort.mb","10");
        Job job = new Job(conf, "word count");
        
        ((JobConf) job.getConfiguration()).setJar(jarFile.toString());
        job.setNumReduceTasks(2);//use to reducer process to process work
        job.setPartitionerClass(MyPartitioner.class);
        
        
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
      }
    }
    Looking for a job working at Home about MSBI
  • 相关阅读:
    第二十课字符串
    数学归纳法:搞定循环与递归的钥匙
    11预处理命令下
    Xshell6无法连接上虚拟机的解决方法
    redis数据库常用命令
    redis使用get命令,中文乱码问题
    Ubuntu下redis的安装和简单操作
    启动hbase后,使用指令进入shell命令行模式时报错"wrong number of arguments (0 for 1)"
    启动hbase报错:“SLF4J: Class path contains multiple SLF4J bindings.”解决方法
    ./bin/hadoop 提示“没有那个文件或目录”解决方法
  • 原文地址:https://www.cnblogs.com/huaxiaoyao/p/4295982.html
Copyright © 2011-2022 走看看