zoukankan      html  css  js  c++  java
  • Mapreduce 反向索引

    反向索引主要用于全文搜索,就是形成一个word url这样的结构
    file1:
    MapReduce is simple
    file2:
    MapReduce is powerful is simple
    file3:
    Hello MapReduce bye MapReduce
    那么经过反向索引后就是:
    Hello     file3.txt:1;
    MapReduce     file3.txt:2;fil1.txt:1;fil2.txt:1;
    bye     file3.txt:1; 
    is     fil1.txt:1;fil2.txt:2;
    powerful     fil2.txt:1;
    simple     fil2.txt:1;fil1.txt:1;
    主要的方法就是,对每个文件的内容进行遍历,形成的key为word+filename,value=1然后在combiner中将key相同的进行累加,这样就得到在同一个文件中word的字数了。最后在reduce中将filename进行分割即可。不过这里有个小的bug,一般来说combiner是在同一个节点上进行reduce,但是我这里却是用于统计同一个文件了,如果说文件很大,那么很有可能一个文件的内容会被分配到两个不同的节点上,那么就有会bug了。所以这里只能适合小的文件。
    PS:获得文件名String filename = ((FileSplit) context.getInputSplit()).getPath().getName();别的似乎没有了。
    public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
     
                     public void map(LongWritable ikey, Text ivalue, Context context)
                                                     throws IOException, InterruptedException {
                                    StringTokenizer st= new StringTokenizer(ivalue.toString());
                                    FileSplit split=new FileSplit();
                                    split = (FileSplit) context.getInputSplit();
                                    InputSplit isplit=context.getInputSplit();
                                    String filename = ((FileSplit) context.getInputSplit()).getPath().getName();
                                     while(st.hasMoreTokens()){
                                                     //int splitIndex = split.getPath().toString().indexOf("file");
                                                    String key=st.nextToken()+":" +filename;
                                                    context.write( new Text(key),new Text("1"));
                                    }
                    }
     
    }
     
     
    public class MyCombiner extends Reducer<Text, Text, Text, Text> {
     
                     public void reduce(Text _key, Iterable<Text> values, Context context)
                                                     throws IOException, InterruptedException {
                                     // process values
                                     int sum=0;
                                     for (Text val : values) {
                                                    sum++;
                                    }
                                    StringTokenizer st= new StringTokenizer(_key.toString(),":");
                                    String key=st.nextToken();
                                    String value=st.nextToken();
                                    value=value+ ":"+sum;
                                    context.write( new Text(key),new Text(value));
                    }
     
    }
     
     
    public class MyReducer extends Reducer<Text, Text, Text, Text> {
     
                     public void reduce(Text _key, Iterable<Text> values, Context context)
                                                     throws IOException, InterruptedException {
                                     // process values
                                    String filelist= new String();
                                     for (Text val : values) {
                                                    filelist=filelist+val.toString()+ ";  ";
                                    }
                                    context.write(_key, new Text(filelist));
                                     //System.out.println(_key.toString()+filelist);
                    }
     
    }
  • 相关阅读:
    mySql基础
    ECSHOP模糊分词搜索和商品列表关键字飘红功能
    smarty在循环的时候计数来显示这是第几次循环的功能
    PHP Warning: 的解决方法
    ECSHOP生成缩略图模糊
    ECSHOP商品描述和文章里不加水印,只在商品图片和商品相册加水印
    ECSHOP_百度收录网址后面有?from=rss
    在ecshop中添加页面,并且实现后台管理
    windows 2003子目录权限丢失及子目录权限无法继承更改的解决方法
    Newtonsoft.Json初探
  • 原文地址:https://www.cnblogs.com/sunrye/p/4543365.html
Copyright © 2011-2022 走看看