zoukankan      html  css  js  c++  java
  • MapReduce 简单的全文搜索2

    上一个全文搜索实现了模糊查找,这个主要实现了精确查找,就是比如你查找mapreduce is simple那么他就只查找有这个句子的文章,而不是查找有这三个单词的文章。
    这个版本需要重写反向索引,因为需要查找句子,所以需要定位每个单词的在文章中的位置,所以我们的反向索引需要加上单词所在的位置,即我们希望的输出是:
    MapReduce file1.txt:<1,2,3>;file2.txt:<5,3,1>;这种格式的。
    其实这一步比较简单。我们在map的时候输出为
    “filename+word” position这样的<key,value>
    “file1.txt:MapReduce”1
    经过本地的combiner将其输出为:
    “filename” “word:<position>” 
    "file1.txt" "MapReduce:<1,2,3>"
    最后经过reduece将所有同一个文件的单词归一,输出为
    "filename" "word1:<position>;word2:<position>...."
    "file1.txt" "MapReduce:<1,2,3>;simple:<5,6,7>"这种格式的
    PS:由于这里的读取是从文件中每次读取一行,所以这里的position只是每一行上的位置,为非该单词在全文中的位置,如果遇到一句话横跨两行,那么这个程序就无法识别了,好像需要重写那个Input了,等下一个版本再修改
     
    接下来主要就是根据索引来查找
    大致的思路就是
    Map阶段通过需要查找的句子例如MapReduce is simple来筛选反向索引中的单词,最后经过Map后得到在被查找的句子中的单词。输出为:
    "filename" "word<position>"
    "file1.txt" "MapReduce<1,2,3>"
    经过reduce,则会把所有相同的文件的word给放在一起。由于reduce中单词的顺序是混乱的,所以为了识别句子,我这里增加了一个类
    class  Address implements Comparable<Address>{
                     public String word ;
                     public int index;
                    Address(String word, int index){
                                     this.word =word;
                                     this.index =index;
                    }
                     public String toString(){
                                     return word +" "+ index;
                    }
                     public int compareTo(Address a){
                                     if(index <a.index) return -1;
                                     else return 1;
                    }
    }
     
    主要的word是用于放单词,index用于放索引,通过将同一个file下的value拆分到Address中,并且按照index进行排序,那么我们就能获得例如
    M 1
    M 2
    M 3
    i    4
    s   5
    i    6
    M   7
    (M代表Mapreduce i代表is,s代表simple)
    那么如何识别这里的句子呢,首先这里的index必须是相邻的,并且这相邻的单词的顺序必须是M i s。为了识别相邻的单词的顺序问题,我这里新建了一个list,用于放输入的参数,也就是我要查找的句子,
    ArrayList<String> sentence= new ArrayList<String>();
                                     for (i=2;i<wordnum+2;i++){
                                                    String arg=conf.get("args" +i);
                                                    sentence.add(arg);
                                    }
     
    接下来我们建立两个游标,一个指向上一个 word position一个指向当前,如果说上一个的word和当前的word在sentence中的位置刚好是相邻的,并且两个index也是相邻的那么n++,接着这两个游标都往下一步走,继续判断,直到n等于句子中单词的长度,那就说明已经匹配到了一个完整的句子。接着n=1再继续往下走,直到遍历完
    具体代码:
     
    public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
     
                     public void map(LongWritable ikey, Text ivalue, Context context)
                                                     throws IOException, InterruptedException {
                                    Configuration conf=context.getConfiguration();
                                    ArrayList< String> contents=new ArrayList< String>();
                                     int agrsnum=Integer.parseInt(conf.get( "argsnum"));
                                     int i=0;
                                     for (i=2;i<agrsnum;i++){
                                                     String arg=conf.get("args"+i);
                                                    contents.add(arg);
                                    }
                                     String line=ivalue.toString();
                                     String key=line.split("         ")[0];
                                     String value=line.split("      ")[1];
                                     for(String content:contents){
                                                     if(content.compareTo(key)==0){
                                                                    StringTokenizer st=new StringTokenizer(value,";" );
                                                                     while(st.hasMoreTokens()){
                                                                                     String s=st.nextToken();
                                                                                     String filename=s.split(":")[0];
                                                                                     String adds=s.split(":")[1];
                                                                                     String val=key+adds;
                                                                                     //System.out.println(filename+"  "+ val);
                                                                                    
                                                                                     //System.out.println("                             ");
                                                                                    context.write( new Text(filename),new Text(val));
                                                                    }
                                                    }
                                    }
                    }
     
    }
     
     
     
    class  Address implements Comparable<Address>{
                     public String word ;
                     public int index;
                    Address(String word, int index){
                                     this.word =word;
                                     this.index =index;
                    }
                     public String toString(){
                                     return word +" "+ index;
                    }
                     public int compareTo(Address a){
                                     if(index <a.index) return -1;
                                     else return 1;
                    }
    }
     
    public class MyReducer extends Reducer<Text, Text, Text, Text> {
     
                     public void reduce(Text _key, Iterable<Text> values, Context context)
                                                     throws IOException, InterruptedException {
                                     // process values
                                    Configuration conf=context.getConfiguration();
                                     int wordnum=Integer.parseInt(conf.get( "argsnum"))-2;
                                     int i=0;
                                    ArrayList<String> sentence= new ArrayList<String>();
                                     for (i=2;i<wordnum+2;i++){
                                                    String arg=conf.get("args" +i);
                                                    sentence.add(arg);
                                    }
                                    
                                    ArrayList<Address> list= new ArrayList<Address>();
                                    
                                     for (Text val : values) {
                                                    String[] line=val.toString().split("<|>|," );
                                                     for(int j=1;j<line.length;j++){
                                                                    Address a=new Address(line[0],Integer.parseInt(line[j]));
                                                                    list.add(a);
                                                    }
                                                    i++;
                                    }
                                    Collections. sort(list);
                                    
                                     for(Address x:list){
                                                    System. out.println(x);
                                                    System. out.println("                    " );
                                    }
                                    
                                     int sum=0;
                                     int n=1;
                                    Address start=list.get(0);
                     for(i=0;i<list.size();i++){
                                    Address now=list.get(i);
                                     if(sentence.indexOf(now.word )-sentence.indexOf(start.word)==1&&now. index-start.index ==1){
                                                    n++;
                                                    start. word=now.word ;
                                                    start. index=now.index ;
                                    } else{
                                                    n=1;
                                                    start. word=now.word ;
                                                    start. index=now.index ;
                                    }
     
                                                     if(n==wordnum){
                                                                    System. out.println("match is " +now);
                                                                    sum++;
                                                                    n=1;
                                                    }
                                                    
                                    
                    }
                                    
                                     /*
                                    for (i=0;i<list.size()-2;i++){
                                    Address t1=list.get(i);
                                    Address t2=list.get(i+1);
                                    Address t3=list.get(i+2);
                                    if((t1.index+2)==t3.index&&(t2.index+1)==t3.index){
                                                    if(t1.add!=t2.add&&t1.add!=t3.add&&t2.add!=t3.add){
                                                                    sum++;
                                                    }
                                    }
                                    
                    }
                    
                    
                    System.out.println("                                       ");
                    System.out.println("sum is "+sum);
                    System.out.println("                                       ");
                    */
                                     if(sum>0){
                                                    context.write(_key, new Text(String.valueOf(sum)));
                                    }
                    }
     
    }
     
     
     
  • 相关阅读:
    mysql数据库基础知识
    js与jquery操作
    4月16日的错题整理
    智还王项目中出现的问题和使用的一些方法
    dom操作
    二维数组的定义与用法
    数组内容
    网页布局时遇到的问题
    css初接触
    表单
  • 原文地址:https://www.cnblogs.com/sunrye/p/4543370.html
Copyright © 2011-2022 走看看