新的Nutch使用了Solr来做了后台的索引服务,nutch正在努力与Solr进行更方便的整合,它很好的与Solr处理了耦合关系,把Solr当成一个服务,Nutch只要调用其客户端就可以与其进行通讯。
1. bin/nutch solrindex
这个命令是用来对抓取下来的内容建立索引,帮助如下:
-
Usage: SolrIndexer <solr url> <crawldb> <linkdb> (<segment> ... | -dir <segments>)
这里我们可以看到第一个参数为<solr url>,这是solr服务的一个地址,第二个参数为抓取的url数据库名,第三个参数为反向链接数据库,第四个参数就segment目录名
使用这个命令的前提是你要有一个相应的Solr服务才行。
2. 看一下SolrIndexer这个类做了些什么
bin/nutch solrindex这个命令最终是调用SolrIndexer的main方法,其中一个最主要是方法是indexSolr方法,
下面来看一下这个方法做了些什么
-
final JobConf job = new NutchJob(getConf());
-
job.setJobName("index-solr " + solrUrl);
-
// 这里会初始化Job任务,设置其Map与Reduce方法
-
IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job);
-
-
-
job.set(SolrConstants.SERVER_URL, solrUrl);
-
-
-
// 这里配置OutputFormat的类
-
NutchIndexWriterFactory.addClassToConf(job, SolrWriter.class);
-
-
-
job.setReduceSpeculativeExecution(false);
-
-
-
final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-" +
-
new Random().nextInt());
-
// 配置输出路径
-
FileOutputFormat.setOutputPath(job, tmp);
-
try {
-
// 提交任务
-
JobClient.runJob(job);
-
// do the commits once and for all the reducers in one go
-
SolrServer solr = new CommonsHttpSolrServer(solrUrl);
-
solr.commit();
-
long end = System.currentTimeMillis();
-
LOG.info("SolrIndexer: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end));
-
}
-
catch (Exception e){
-
LOG.error(e);
-
} finally {
-
FileSystem.get(job).delete(tmp, true);
-
}
下面来看一下IndexMapReduce.initMRJob这个方法做了些什么
-
public static void initMRJob(Path crawlDb, Path linkDb,
-
Collection<Path> segments,
-
JobConf job) {
-
-
-
LOG.info("IndexerMapReduce: crawldb: " + crawlDb);
-
LOG.info("IndexerMapReduce: linkdb: " + linkDb);
-
-
-
// 加入segment中要建立索引的目录
-
for (final Path segment : segments) {
-
LOG.info("IndexerMapReduces: adding segment: " + segment);
-
FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.FETCH_DIR_NAME)); // crawl_fetch
-
FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.PARSE_DIR_NAME)); // fetch_parse
-
FileInputFormat.addInputPath(job, new Path(segment, ParseData.DIR_NAME)); // parse_data
-
FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME)); // parse_text
-
}
-
-
-
FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME)); // crawldb/current
-
FileInputFormat.addInputPath(job, new Path(linkDb, LinkDb.CURRENT_NAME)); // linkdb/current
-
job.setInputFormat(SequenceFileInputFormat.class); // 设置输入的文件格式, 这里所有目录中的文件格式都是SequenceFileInputFormat,
-
-
-
// 设置Map与Reduce的类型
-
job.setMapperClass(IndexerMapReduce.class);
-
job.setReducerClass(IndexerMapReduce.class);
-
-
-
// 设置输出类型
-
job.setOutputFormat(IndexerOutputFormat.class);
-
job.setOutputKeyClass(Text.class);
-
job.setMapOutputValueClass(NutchWritable.class); // 这里设置了Map输出的Value的类型,key类型还是上面的Text
-
job.setOutputValueClass(NutchWritable.class);
-
}
-
-
-
IndexerMapRducer中的Map只是读入<key,value>对,把value做NutchWritable进行了封装再输出,下面来看一下IndexerMapReduce中的Reduce方法做了些什么
-
public void reduce(Text key, Iterator<NutchWritable> values,
-
OutputCollector<Text, NutchDocument> output, Reporter reporter)
-
throws IOException {
-
Inlinks inlinks = null;
-
CrawlDatum dbDatum = null;
-
CrawlDatum fetchDatum = null;
-
ParseData parseData = null;
-
ParseText parseText = null;
-
// 这一块代码是判断相同key的value的类型,根据其类型来对
-
// inlinks,dbDatum,fetchDatum,parseData,praseText对象进行赋值
-
while (values.hasNext()) {
-
final Writable value = values.next().get(); // unwrap
-
if (value instanceof Inlinks) {
-
inlinks = (Inlinks)value;
-
} else if (value instanceof CrawlDatum) {
-
final CrawlDatum datum = (CrawlDatum)value;
-
if (CrawlDatum.hasDbStatus(datum))
-
dbDatum = datum;
-
else if (CrawlDatum.hasFetchStatus(datum)) {
-
// don't index unmodified (empty) pages
-
if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
-
fetchDatum = datum;
-
} else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
-
CrawlDatum.STATUS_SIGNATURE == datum.getStatus() ||
-
CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {
-
continue;
-
} else {
-
throw new RuntimeException("Unexpected status: "+datum.getStatus());
-
}
-
} else if (value instanceof ParseData) {
-
parseData = (ParseData)value;
-
} else if (value instanceof ParseText) {
-
parseText = (ParseText)value;
-
} else if (LOG.isWarnEnabled()) {
-
LOG.warn("Unrecognized type: "+value.getClass());
-
}
-
}
-
-
-
if (fetchDatum == null || dbDatum == null
-
|| parseText == null || parseData == null) {
-
return; // only have inlinks
-
}
-
-
-
if (!parseData.getStatus().isSuccess() ||
-
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
-
return;
-
}
-
-
-
// 生成一个可以索引的文档对象,在Lucene中,Docuemnt就是一个抽象的文档对象,其有Fields组成,而Field又由Terms组成
-
NutchDocument doc = new NutchDocument();
-
final Metadata metadata = parseData.getContentMeta();
-
-
-
// add segment, used to map from merged index back to segment files
-
doc.add("segment", metadata.get(Nutch.SEGMENT_NAME_KEY));
-
-
-
// add digest, used by dedup
-
doc.add("digest", metadata.get(Nutch.SIGNATURE_KEY));
-
-
-
final Parse parse = new ParseImpl(parseText, parseData);
-
try {
-
// extract information from dbDatum and pass it to
-
// fetchDatum so that indexing filters can use it
-
final Text url = (Text) dbDatum.getMetaData().get(Nutch.WRITABLE_REPR_URL_KEY);
-
if (url != null) {
-
fetchDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, url);
-
}
-
// run indexing filters
-
doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);
-
} catch (final IndexingException e) {
-
if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
-
return;
-
}
-
-
-
// skip documents discarded by indexing filters
-
if (doc == null) return;
-
-
-
float boost = 1.0f;
-
// run scoring filters
-
try {
-
boost = this.scfilters.indexerScore(key, doc, dbDatum,
-
fetchDatum, parse, inlinks, boost);
-
} catch (final ScoringFilterException e) {
-
if (LOG.isWarnEnabled()) {
-
LOG.warn("Error calculating score " + key + ": " + e);
-
}
-
return;
-
}
-
// apply boost to all indexed fields.
-
doc.setWeight(boost);
-
// store boost for use by explain and dedup
-
doc.add("boost", Float.toString(boost));
-
-
-
// 收集输出结果,用下面的IndexerOutputFormat写到Solr中去
-
output.collect(key, doc);
-
}
下面来看一下IndexerOutputFormat中的getRecordWriter是如何实现的
-
@Override
-
public RecordWriter<Text, NutchDocument> getRecordWriter(FileSystem ignored,
-
JobConf job, String name, Progressable progress) throws IOException {
-
-
// populate JobConf with field indexing options
-
IndexingFilters filters = new IndexingFilters(job);
-
-
/ 这里可以写到多个输出源中
-
final NutchIndexWriter[] writers =
-
NutchIndexWriterFactory.getNutchIndexWriters(job);
-
-
-
for (final NutchIndexWriter writer : writers) {
-
writer.open(job, name);
-
}
-
/ 这里使用了一个inner class来返回相应的RecordWriter,用于输出Reduce收集的<key,value>对
-
return new RecordWriter<Text, NutchDocument>() {
-
-
-
public void close(Reporter reporter) throws IOException {
-
for (final NutchIndexWriter writer : writers) {
-
writer.close();
-
}
-
}
-
-
-
public void write(Text key, NutchDocument doc) throws IOException {
-
for (final NutchIndexWriter writer : writers) {
-
writer.write(doc);
-
}
-
}
-
};
-
}
这里有多个NutchIndexWriter,目前只有一个子类,就是SolrWriter,下面分析一下其write方法做了些什么
-
public void write(NutchDocument doc) throws IOException {
-
final SolrInputDocument inputDoc = new SolrInputDocument();
-
// 生成Solr的InputDocuement对象
-
for(final Entry<String, NutchField> e : doc) {
-
for (final Object val : e.getValue().getValues()) {
-
// normalise the string representation for a Date
-
Object valval2 = val;
-
if (val instanceof Date){
-
val2 = DateUtil.getThreadLocalDateFormat().format(val);
-
}
-
inputDoc.addField(solrMapping.mapKey(e.getKey()), val2, e.getValue().getWeight());
-
String sCopy = solrMapping.mapCopyKey(e.getKey());
-
if (sCopy != e.getKey()) {
-
inputDoc.addField(sCopy, val);
-
}
-
}
-
}
-
inputDoc.setDocumentBoost(doc.getWeight());
-
inputDocs.add(inputDoc); // 加入缓冲
-
if (inputDocs.size() >= commitSize) { // 缓冲到达commitSize后,调用solr客户端的add方法写出到Solr服务端
-
try {
-
solr.add(inputDocs);
-
} catch (final SolrServerException e) {
-
throw makeIOException(e);
-
}
-
inputDocs.clear();
-
}
-
}
3. 总结
这里大概介绍了一下Nutch对于抓取内容的索引建立过程,也使用了一个MP任务,在Reduce端主要是把要索引的字段生成了一个NutchDocument对象,再通过SolrWriter写出到Solr的服务端,这里SolrWriter封装了Solr的客户端对象,在这里要把Nutch中的Document转换成Solr中的Document,因为这边的NutchDocument是一个可Writable的类型,它一定要是可序列化的,而SorlInputDocument是SolrInputFormat是不可以被序列化的。
作者:http://blog.csdn.net/amuseme_lu
相关文章阅读及免费下载:
《Apache Nutch 1.3 学习笔记三(Inject)》
《Apache Nutch 1.3 学习笔记三(Inject CrawlDB Reader)》
《Apache Nutch 1.3 学习笔记四(Generate)》
《Apache Nutch 1.3 学习笔记四(SegmentReader分析)》
《Apache Nutch 1.3 学习笔记五(FetchThread)》
《Apache Nutch 1.3 学习笔记五(Fetcher流程)》
《Apache Nutch 1.3 学习笔记六(ParseSegment)》
《Apache Nutch 1.3 学习笔记七(CrawlDb - updatedb)》
《Apache Nutch 1.3 学习笔记八(LinkDb)》
《Apache Nutch 1.3 学习笔记九(SolrIndexer)》
《Apache Nutch 1.3 学习笔记十(Ntuch 插件机制简单介绍)》
《Apache Nutch 1.3 学习笔记十(插件扩展)》
《Apache Nutch 1.3 学习笔记十(插件机制分析)》
《Apache Nutch 1.3 学习笔记十一(页面评分机制 OPIC)》
《Apache Nutch 1.3 学习笔记十一(页面评分机制 LinkRank 介绍)》