zoukankan      html  css  js  c++  java
  • InAction-MR的topK


    本来只是想拿搜狗的数据练练手的,却无意踏足MR的topK问题。经过几番波折,虽然现在看起来很简单,但是摸爬滚打中也学到了不少


    数据是搜狗实验室下的搜索日志,格式大概为:

    1 00:00:00    2982199073774412    [360安全卫士]    8 3    download.it.com.cn/softweb/software/firewall/antivirus/20067/17938.html
    2 00:00:00    07594220010824798    [哄抢救灾物资]    1 1    news.21cn.com/social/daqian/2008/05/29/4777194_1.shtml
    3 00:00:00    5228056822071097    [75810部队]    14 5    www.greatoo.com/greatoo_cn/list.asp?link_id=276&title=%BE%DE%C2%D6%D0%C2%CE%C5
    4 00:00:00    6140463203615646    [绳艺]    62 36    www.jd-cd.com/jd_opus/xx/200607/706.html
    5 00:00:00    8561366108033201    [汶川地震原因]    3 2    www.big38.net/
    6 00:00:00    23908140386148713    [莫衷一是的意思]    1 2    www.chinabaike.com/article/81/82/110/2007/2007020724490.html
    7 00:00:00    1797943298449139    [星梦缘全集在线观看]    8 5    www.6wei.net/dianshiju/????xa1xe9|????do=index
    8 00:00:00    00717725924582846    [闪字吧]    1 2    www.shanziba.com/

    我只是要搜索词,其他的不管,然后通过MR计算出搜索量最高的前N个词(N自定义)

    整体项目结构为:


    先来个类处理根据日志格式拿出搜索词

    SEA.java

     1 package org.admln.topK;
     2 
     3 /**
     4  * @author admln
     5  *
     6  */
     7 public class SEA {
     8     
     9     private String seaWord;
    10     
    11     private boolean isValid;
    12     
    13     public static SEA parser(String line) {
    14         SEA sea = new SEA();
    15         String str = line.split("	")[2];
    16         if(str.length()<3) {
    17             sea.setValid(false);
    18         }else {
    19             sea.setValid(true);
    20             sea.setSeaWord(str.substring(1, str.length()-1));
    21         }
    22         return sea;
    23     }
    24 
    25 
    26     public String getSeaWord() {
    27         return seaWord;
    28     }
    29 
    30 
    31     public void setSeaWord(String seaWord) {
    32         this.seaWord = seaWord;
    33     }
    34 
    35 
    36     public boolean isValid() {
    37         return isValid;
    38     }
    39 
    40 
    41     public void setValid(boolean isValid) {
    42         this.isValid = isValid;
    43     }
    44 
    45 }

    然后就是MR

     1 package org.admln.topK;
     2 
     3 import java.io.IOException;
     4 import java.util.Collections;
     5 import java.util.Map.Entry;
     6 import java.util.Set;
     7 import java.util.TreeMap;
     8 
     9 import org.apache.hadoop.conf.Configuration;
    10 import org.apache.hadoop.fs.Path;
    11 import org.apache.hadoop.io.IntWritable;
    12 import org.apache.hadoop.io.Text;
    13 import org.apache.hadoop.mapreduce.Job;
    14 import org.apache.hadoop.mapreduce.Mapper;
    15 import org.apache.hadoop.mapreduce.Reducer;
    16 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    17 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    18 
    19 /**
    20  * @author admln
    21  *
    22  */
    23 public class TopK {
    24 
    25     public static class topKMapper extends
    26             Mapper<Object, Text, Text, IntWritable> {
    27         Text word = new Text();
    28         IntWritable ONE = new IntWritable(1);
    29 
    30         @Override
    31         public void map(Object key, Text value, Context context)
    32                 throws IOException, InterruptedException {
    33             SEA sea = SEA.parser(value.toString());
    34             if (sea.isValid()) {
    35                 word.set(sea.getSeaWord());
    36                 context.write(word, ONE);
    37             }
    38         }
    39     }
    40 
    41     public static class topKReducer extends
    42             Reducer<Text, IntWritable, Text, IntWritable> {
    43         int sum;
    44         int max;
    45         private static TreeMap<Integer,String> tree = new TreeMap<Integer,String>(Collections.reverseOrder());
    46 
    47         public void reduce(Text key, Iterable<IntWritable> values,
    48                 Context context) {
    49             sum = 0;
    50             max = context.getConfiguration().getInt("topK", 10);
    51             for (IntWritable val : values) {
    52                 sum += val.get();
    53             }
    54             tree.put(Integer.valueOf(sum), key.toString());
    55             if (tree.size() > max) {
    56                 tree.remove(tree.lastKey());
    57             }
    58 
    59         }
    60 
    61         @Override
    62         protected void cleanup(Context context) throws IOException, InterruptedException {
    63             Set<Entry<Integer, String>> set = tree.entrySet();
    64             for (Entry<Integer, String> entry : set) {
    65                 context.write(new Text(entry.getValue()), new IntWritable(entry.getKey()));
    66             }
    67         }
    68     }
    69 
    70     public static void main(String[] args) throws Exception {
    71         Path input = new Path("hdfs://hadoop:8020/input/topK/");
    72         Path output = new Path("hdfs://hadoop:8020/output/topK/");
    73 
    74         Configuration conf = new Configuration();
    75         
    76         conf.setInt("topK", Integer.valueOf(args[1]));
    77 
    78         Job job = new Job(conf, "topK");
    79  
    80         job.setJarByClass(TopK.class);
    81 
    82         job.setMapperClass(topKMapper.class);
    83         job.setReducerClass(topKReducer.class);
    84 
    85         job.setOutputKeyClass(Text.class);
    86         job.setOutputValueClass(IntWritable.class);
    87 
    88         FileInputFormat.addInputPath(job, input);
    89         FileOutputFormat.setOutputPath(job, output);
    90 
    91         System.exit(job.waitForCompletion(true) ? 0 : 1);
    92 
    93     }
    94 
    95 }

    然后上传数据(注意文件格式要从gb2312改成utf-8的。因为hadoop全部是utf-8编码的。如果不转码最后结果中文就是乱码)

    本机调试或者上传到hadoop上运行

    机器环境是centos6.4、hadoop是2.2.0、JDK是1.7

    运行结果:


    重要知识点:

      1.TreeMap,虽然是Java的知识,还是普及了一下;

      2.cleanup,这个复写API的执行时间要知道。


     源码:http://pan.baidu.com/s/1i3y0rwL


    欲为大树,何与草争;心若不动,风又奈何。
  • 相关阅读:
    微软职位内部推荐-SENIOR DEVELOPMENT LEAD
    微软职位内部推荐-SENIOR SDE
    微软职位内部推荐-Software Development Engineer 2
    微软职位内部推荐-SENIOR PRODUCER
    微软职位内部推荐-Senior Data Scientist
    微软职位内部推荐-Data Scientist
    微软职位内部推荐-Sr Development Lead-OSG-IPX
    类和对象
    Netty线程模型
    sbit命令行中运行scala脚本
  • 原文地址:https://www.cnblogs.com/admln/p/InAction-MRtopK.html
Copyright © 2011-2022 走看看