Apache Nutch 1.3 学习笔记九（SolrIndexer）

zoukankan html css js c++ java

Apache Nutch 1.3 学习笔记九（SolrIndexer）
新的Nutch使用了Solr来做了后台的索引服务，nutch正在努力与Solr进行更方便的整合，它很好的与Solr处理了耦合关系，把Solr当成一个服务，Nutch只要调用其客户端就可以与其进行通讯。

1. bin/nutch solrindex

这个命令是用来对抓取下来的内容建立索引，帮助如下：
1. Usage: SolrIndexer <solr url> <crawldb> <linkdb> (<segment> ... | -dir <segments>)
这里我们可以看到第一个参数为<solr url>，这是solr服务的一个地址，第二个参数为抓取的url数据库名，第三个参数为反向链接数据库，第四个参数就segment目录名

使用这个命令的前提是你要有一个相应的Solr服务才行。

2. 看一下SolrIndexer这个类做了些什么

bin/nutch solrindex这个命令最终是调用SolrIndexer的main方法，其中一个最主要是方法是indexSolr方法，
下面来看一下这个方法做了些什么
1. final JobConf job = new NutchJob(getConf());
2. job.setJobName("index-solr " + solrUrl);
3. // 这里会初始化Job任务，设置其Map与Reduce方法
4. IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job);
7. job.set(SolrConstants.SERVER_URL, solrUrl);
10. // 这里配置OutputFormat的类
11. NutchIndexWriterFactory.addClassToConf(job, SolrWriter.class);
14. job.setReduceSpeculativeExecution(false);
17. final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-" +
18. new Random().nextInt());
19. // 配置输出路径
20. FileOutputFormat.setOutputPath(job, tmp);
21. try {
22. // 提交任务
23. JobClient.runJob(job);
24. // do the commits once and for all the reducers in one go
25. SolrServer solr = new CommonsHttpSolrServer(solrUrl);
26. solr.commit();
27. long end = System.currentTimeMillis();
28. LOG.info("SolrIndexer: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end));
29. }
30. catch (Exception e){
31. LOG.error(e);
32. } finally {
33. FileSystem.get(job).delete(tmp, true);
34. }
下面来看一下IndexMapReduce.initMRJob这个方法做了些什么
1. public static void initMRJob(Path crawlDb, Path linkDb,
2. Collection<Path> segments,
3. JobConf job) {
6. LOG.info("IndexerMapReduce: crawldb: " + crawlDb);
7. LOG.info("IndexerMapReduce: linkdb: " + linkDb);
10. // 加入segment中要建立索引的目录
11. for (final Path segment : segments) {
12. LOG.info("IndexerMapReduces: adding segment: " + segment);
13. FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.FETCH_DIR_NAME)); // crawl_fetch
14. FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.PARSE_DIR_NAME)); // fetch_parse
15. FileInputFormat.addInputPath(job, new Path(segment, ParseData.DIR_NAME)); // parse_data
16. FileInputFormat.addInputPath(job, new Path(segment, ParseText.DIR_NAME)); // parse_text
17. }
20. FileInputFormat.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME)); // crawldb/current
21. FileInputFormat.addInputPath(job, new Path(linkDb, LinkDb.CURRENT_NAME)); // linkdb/current
22. job.setInputFormat(SequenceFileInputFormat.class); // 设置输入的文件格式, 这里所有目录中的文件格式都是SequenceFileInputFormat，
25. // 设置Map与Reduce的类型
26. job.setMapperClass(IndexerMapReduce.class);
27. job.setReducerClass(IndexerMapReduce.class);
30. // 设置输出类型
31. job.setOutputFormat(IndexerOutputFormat.class);
32. job.setOutputKeyClass(Text.class);
33. job.setMapOutputValueClass(NutchWritable.class); // 这里设置了Map输出的Value的类型,key类型还是上面的Text
34. job.setOutputValueClass(NutchWritable.class);
35. }
38. IndexerMapRducer中的Map只是读入<key,value>对，把value做NutchWritable进行了封装再输出，下面来看一下IndexerMapReduce中的Reduce方法做了些什么
39. public void reduce(Text key, Iterator<NutchWritable> values,
40. OutputCollector<Text, NutchDocument> output, Reporter reporter)
41. throws IOException {
42. Inlinks inlinks = null;
43. CrawlDatum dbDatum = null;
44. CrawlDatum fetchDatum = null;
45. ParseData parseData = null;
46. ParseText parseText = null;
47. // 这一块代码是判断相同key的value的类型，根据其类型来对
48. // inlinks,dbDatum,fetchDatum,parseData,praseText对象进行赋值
49. while (values.hasNext()) {
50. final Writable value = values.next().get(); // unwrap
51. if (value instanceof Inlinks) {
52. inlinks = (Inlinks)value;
53. } else if (value instanceof CrawlDatum) {
54. final CrawlDatum datum = (CrawlDatum)value;
55. if (CrawlDatum.hasDbStatus(datum))
56. dbDatum = datum;
57. else if (CrawlDatum.hasFetchStatus(datum)) {
58. // don't index unmodified (empty) pages
59. if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
60. fetchDatum = datum;
61. } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
62. CrawlDatum.STATUS_SIGNATURE == datum.getStatus() ||
63. CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {
64. continue;
65. } else {
66. throw new RuntimeException("Unexpected status: "+datum.getStatus());
67. }
68. } else if (value instanceof ParseData) {
69. parseData = (ParseData)value;
70. } else if (value instanceof ParseText) {
71. parseText = (ParseText)value;
72. } else if (LOG.isWarnEnabled()) {
73. LOG.warn("Unrecognized type: "+value.getClass());
74. }
75. }
78. if (fetchDatum == null || dbDatum == null
79. || parseText == null || parseData == null) {
80. return; // only have inlinks
81. }
84. if (!parseData.getStatus().isSuccess() ||
85. fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
86. return;
87. }
90. // 生成一个可以索引的文档对象，在Lucene中，Docuemnt就是一个抽象的文档对象，其有Fields组成，而Field又由Terms组成
91. NutchDocument doc = new NutchDocument();
92. final Metadata metadata = parseData.getContentMeta();
95. // add segment, used to map from merged index back to segment files
96. doc.add("segment", metadata.get(Nutch.SEGMENT_NAME_KEY));
99. // add digest, used by dedup
100. doc.add("digest", metadata.get(Nutch.SIGNATURE_KEY));
103. final Parse parse = new ParseImpl(parseText, parseData);
104. try {
105. // extract information from dbDatum and pass it to
106. // fetchDatum so that indexing filters can use it
107. final Text url = (Text) dbDatum.getMetaData().get(Nutch.WRITABLE_REPR_URL_KEY);
108. if (url != null) {
109. fetchDatum.getMetaData().put(Nutch.WRITABLE_REPR_URL_KEY, url);
110. }
111. // run indexing filters
112. doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);
113. } catch (final IndexingException e) {
114. if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
115. return;
116. }
119. // skip documents discarded by indexing filters
120. if (doc == null) return;
123. float boost = 1.0f;
124. // run scoring filters
125. try {
126. boost = this.scfilters.indexerScore(key, doc, dbDatum,
127. fetchDatum, parse, inlinks, boost);
128. } catch (final ScoringFilterException e) {
129. if (LOG.isWarnEnabled()) {
130. LOG.warn("Error calculating score " + key + ": " + e);
131. }
132. return;
133. }
134. // apply boost to all indexed fields.
135. doc.setWeight(boost);
136. // store boost for use by explain and dedup
137. doc.add("boost", Float.toString(boost));
140. // 收集输出结果，用下面的IndexerOutputFormat写到Solr中去
141. output.collect(key, doc);
142. }
下面来看一下IndexerOutputFormat中的getRecordWriter是如何实现的
1. @Override
2. public RecordWriter<Text, NutchDocument> getRecordWriter(FileSystem ignored,
3. JobConf job, String name, Progressable progress) throws IOException {
5. // populate JobConf with field indexing options
6. IndexingFilters filters = new IndexingFilters(job);
8. / 这里可以写到多个输出源中
9. final NutchIndexWriter[] writers =
10. NutchIndexWriterFactory.getNutchIndexWriters(job);
13. for (final NutchIndexWriter writer : writers) {
14. writer.open(job, name);
15. }
16. / 这里使用了一个inner class来返回相应的RecordWriter，用于输出Reduce收集的<key,value>对
17. return new RecordWriter<Text, NutchDocument>() {
20. public void close(Reporter reporter) throws IOException {
21. for (final NutchIndexWriter writer : writers) {
22. writer.close();
23. }
24. }
27. public void write(Text key, NutchDocument doc) throws IOException {
28. for (final NutchIndexWriter writer : writers) {
29. writer.write(doc);
30. }
31. }
32. };
33. }
这里有多个NutchIndexWriter,目前只有一个子类，就是SolrWriter,下面分析一下其write方法做了些什么
1. public void write(NutchDocument doc) throws IOException {
2. final SolrInputDocument inputDoc = new SolrInputDocument();
3. // 生成Solr的InputDocuement对象
4. for(final Entry<String, NutchField> e : doc) {
5. for (final Object val : e.getValue().getValues()) {
6. // normalise the string representation for a Date
7. Object valval2 = val;
8. if (val instanceof Date){
9. val2 = DateUtil.getThreadLocalDateFormat().format(val);
10. }
11. inputDoc.addField(solrMapping.mapKey(e.getKey()), val2, e.getValue().getWeight());
12. String sCopy = solrMapping.mapCopyKey(e.getKey());
13. if (sCopy != e.getKey()) {
14. inputDoc.addField(sCopy, val);
15. }
16. }
17. }
18. inputDoc.setDocumentBoost(doc.getWeight());
19. inputDocs.add(inputDoc); // 加入缓冲
20. if (inputDocs.size() >= commitSize) { // 缓冲到达commitSize后，调用solr客户端的add方法写出到Solr服务端
21. try {
22. solr.add(inputDocs);
23. } catch (final SolrServerException e) {
24. throw makeIOException(e);
25. }
26. inputDocs.clear();
27. }
28. }
3. 总结

这里大概介绍了一下Nutch对于抓取内容的索引建立过程，也使用了一个MP任务，在Reduce端主要是把要索引的字段生成了一个NutchDocument对象，再通过SolrWriter写出到Solr的服务端，这里SolrWriter封装了Solr的客户端对象，在这里要把Nutch中的Document转换成Solr中的Document，因为这边的NutchDocument是一个可Writable的类型，它一定要是可序列化的，而SorlInputDocument是SolrInputFormat是不可以被序列化的。

作者：http://blog.csdn.net/amuseme_lu

相关文章阅读及免费下载：

《Apache Nutch 1.3 学习笔记目录》

《Apache Nutch 1.3 学习笔记一》

《Apache Nutch 1.3 学习笔记二》

《Apache Nutch 1.3 学习笔记三（Inject）》

《Apache Nutch 1.3 学习笔记三（Inject CrawlDB Reader）》

《Apache Nutch 1.3 学习笔记四（Generate）》

《Apache Nutch 1.3 学习笔记四（SegmentReader分析）》

《Apache Nutch 1.3 学习笔记五（FetchThread）》

《Apache Nutch 1.3 学习笔记五（Fetcher流程）》

《Apache Nutch 1.3 学习笔记六（ParseSegment）》

《Apache Nutch 1.3 学习笔记七（CrawlDb - updatedb）》

《Apache Nutch 1.3 学习笔记八（LinkDb）》

《Apache Nutch 1.3 学习笔记九（SolrIndexer）》

《Apache Nutch 1.3 学习笔记十（Ntuch 插件机制简单介绍）》

《Apache Nutch 1.3 学习笔记十（插件扩展）》

《Apache Nutch 1.3 学习笔记十（插件机制分析）》

《Apache Nutch 1.3 学习笔记十一（页面评分机制 OPIC）》

《Apache Nutch 1.3 学习笔记十一（页面评分机制 LinkRank 介绍）》

《Apache Nutch 1.3 学习笔记十二（Nutch 2.0 的主要变化）》

更多《Apache Nutch文档》，尽在开卷有益360 http://www.docin.com/book_360
博客地址：【爱开卷360】http://www.cnblogs.com/ibook360
查看全文

相关阅读:
浅析 Java 中的 final 关键字
 谷歌Java编程风格指南
 分布式事务之两阶段提交协议（2PC）and 使用事件和消息队列实现分布式事务
 零和博弈与木桶定律
 Executors类创建四种常见线程池
 软件设计的原则&101个设计模式-2011年04月25日陈皓
 编程中的命名设计那点事-陈皓
 从面向对象的设计模式看软件设计- 2013年02月01日陈皓
 SQL语句
 分布式事务

原文地址：https://www.cnblogs.com/ibook360/p/2222174.html

Apache Nutch 1.3 学习笔记九（SolrIndexer）

1. bin/nutch solrindex

2. 看一下SolrIndexer这个类做了些什么

3. 总结