zoukankan      html  css  js  c++  java
  • Hadoop--eclipse写MapReduce代码在Hadoop上执行单词统计

    一、需要的jar包

    Hadoop-2.4.1sharehadoophdfshadoop-hdfs-2.4.1.jar
    hadoop-2.4.1sharehadoophdfslib所有jar包

    hadoop-2.4.1sharehadoopcommonhadoop-common-2.4.1.jar
    hadoop-2.4.1sharehadoopcommonlib所有jar包

    hadoop-2.4.1sharehadoopmapreduce除hadoop-mapreduce-examples-2.4.1.jar之外的jar包
    hadoop-2.4.1sharehadoopmapreducelib所有jar包

    二、代码

    mapper类

    package kgc.mapred;
    
    import java.io.IOException;
    import java.util.StringTokenizer;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        static IntWritable one = new IntWritable(1);
        static Text word =new Text("");
    
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer words =new StringTokenizer(value.toString());
            //String[] words = value.toString().split();
            while(words.hasMoreTokens()) {
                word.set(words.nextToken());
                context.write(word, one);
            }
        }
    }

    reduce类

    package kgc.mapred;
    
    import java.io.IOException;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        protected void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int count = 0;
            for (IntWritable num : values) {
                count = count + num.get();
            }
            context.write(key, new IntWritable(count));
        }
    }

    提交main类

    package kgc.mapred;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    
    public class WordCount 
    {
        public static void main( String[] args ) throws Exception
        {
            //Hadoop配置
            Configuration cfg = new Configuration();
            //作业
            Job job = Job.getInstance(cfg, "WordCountMR" );
    
            //设置包含Mapper和Reducer定义的jar
            job.setJar("wordcount-0.0.1.jar");
            //job.setJarByClass(WordCount.class);
    
            //map任务处理类
            job.setMapperClass(WordCountMapper.class);
            //reduce任务处理类
            job.setReducerClass(WordCountReducer.class);
    
            //初始输入格式
            job.setInputFormatClass(TextInputFormat.class);
            // 
     return  
     newline
            //初始输入文件
            FileInputFormat.addInputPath(job,  new Path(args[0]));
    
            //最终输出格式
            job.setOutputFormatClass(TextOutputFormat.class);
            //最终输出路径
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
            //map、reduce统一输出类型.
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
    
            //作业执行
            job.waitForCompletion(true);
    
        }
    }

    三、上传jar包在Hadoop中运行

      1.如果你是用maven的project,那么可以直接在run as中的maven install生成相应的jar包。其余步骤相同。

     2.如果是普通的java项目,在File-->Export-->Runnable jar file-->保存路径-->单选第一个按钮-->生成jar包

    利用xshell将jar文件上传到虚拟机即可。

    3.上传要计算的文件,同样用xshell上传到虚拟机后,利用如下命令放在HDFS下:

    hadoop fs -put /文件当前路径  /放在HDFS里的路径

    4.运行jar文件并计算文件内容:命令如下:

    yarn jar jar文件 /需要计算的文件  /文件输出的路径(确保文件不存在)

    5.接下来会有job任务显示,并对进度进行提示,最后给出完成报告。

    当然,不仅在控制台可以看到,在Hadoop的后台一样可以看到。只需在网页上输入 IP:8080,即可查看计算进度,或IP:50070都可以查看文件生成情况。

  • 相关阅读:
    企业级应用框架设计备忘录
    DBHelper
    Oracle客户端精简绿色版 不安装oracle客户端 转载
    回车转TAB
    excel列显示列号 转载
    XtraTreeList Drag Event
    XmlSerializer vs DataContractSerializer: Serialization in Wcf 转载
    转载 在Windows64位环境下.net访问Oracle解决方案
    正则表达式学习笔记
    SQL相关子查询的例子
  • 原文地址:https://www.cnblogs.com/qianshuixianyu/p/9366517.html
Copyright © 2011-2022 走看看