Mapreduce 反向索引

zoukankan html css js c++ java

Mapreduce 反向索引

反向索引主要用于全文搜索，就是形成一个word url这样的结构
file1:

MapReduce is simple

file2:

MapReduce is powerful is simple

file3:

Hello MapReduce bye MapReduce

那么经过反向索引后就是：

Hello     file3.txt:1;
MapReduce     file3.txt:2;fil1.txt:1;fil2.txt:1;
bye     file3.txt:1;
is     fil1.txt:1;fil2.txt:2;
powerful     fil2.txt:1;
simple     fil2.txt:1;fil1.txt:1;

主要的方法就是，对每个文件的内容进行遍历，形成的key为word+filename，value=1然后在combiner中将key相同的进行累加，这样就得到在同一个文件中word的字数了。最后在reduce中将filename进行分割即可。不过这里有个小的bug，一般来说combiner是在同一个节点上进行reduce，但是我这里却是用于统计同一个文件了，如果说文件很大，那么很有可能一个文件的内容会被分配到两个不同的节点上，那么就有会bug了。所以这里只能适合小的文件。

PS：获得文件名String filename = ((FileSplit) context.getInputSplit()).getPath().getName();别的似乎没有了。

public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

                 public void map(LongWritable ikey, Text ivalue, Context context)

                                                 throws IOException, InterruptedException {

                                StringTokenizer st= new StringTokenizer(ivalue.toString());

                                FileSplit split=new FileSplit();

                                split = (FileSplit) context.getInputSplit();

                                InputSplit isplit=context.getInputSplit();

                                String filename = ((FileSplit) context.getInputSplit()).getPath().getName();

                                 while(st.hasMoreTokens()){

                                                 //int splitIndex = split.getPath().toString().indexOf("file");

                                                String key=st.nextToken()+":" +filename;

                                                context.write( new Text(key),new Text("1"));

                                }

                }

}

public class MyCombiner extends Reducer<Text, Text, Text, Text> {

                 public void reduce(Text _key, Iterable<Text> values, Context context)

                                                 throws IOException, InterruptedException {

                                 // process values

                                 int sum=0;

                                 for (Text val : values) {

                                                sum++;

                                }

                                StringTokenizer st= new StringTokenizer(_key.toString(),":");

                                String key=st.nextToken();

                                String value=st.nextToken();

                                value=value+ ":"+sum;

                                context.write( new Text(key),new Text(value));

                }

}

public class MyReducer extends Reducer<Text, Text, Text, Text> {

                 public void reduce(Text _key, Iterable<Text> values, Context context)

                                                 throws IOException, InterruptedException {

                                 // process values

                                String filelist= new String();

                                 for (Text val : values) {

                                                filelist=filelist+val.toString()+ "; ";

                                }

                                context.write(_key, new Text(filelist));

                                 //System.out.println(_key.toString()+filelist);

                }

}

查看全文

相关阅读:
【caffe】create_mnist.sh在windows下的解决方案
 【caffe】loss function、cost function和error
【caffe】未定义函数或变量caffe_
【caffe】无法找到gpu/mxGPUArray.h: No such file or directory
maven常见问题处理（3-1）修改maven 默认使用的 jdk 版本
 SpringCloud是什么？
SpringCloud的服务网关zuul
SpringCloud的EurekaClient ：客户端应用访问注册的微服务（有断路器场景）
SpringBoot 概念和起步
 YML（1）什么是 YML

原文地址：https://www.cnblogs.com/sunrye/p/4543365.html