一个偶然需求,需要对pdf(非扫描)文档进行索引,
schema.xml
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="content" type="text_general" indexed="true" stored="true" required="true" />
<field name="size" type="slong" indexed="true" stored="true" required="true" />
<dynamicField name="ignored_*" type="ignored" multiValued="true" indexed="false" stored="false"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>content</defaultSearchField>
<solrQueryParser defaultOperator="AND"/>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="content" type="text_general" indexed="true" stored="true" required="true" />
<field name="size" type="slong" indexed="true" stored="true" required="true" />
<dynamicField name="ignored_*" type="ignored" multiValued="true" indexed="false" stored="false"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>content</defaultSearchField>
<solrQueryParser defaultOperator="AND"/>
solrconfig.xml需要配置的地方为:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">content</str>
<str name="fmap.stream_size">size</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<!--<str name="fmap.a">links</str> -->
<!--<str name="fmap.div">ignored_div</str> -->
</lst>
</requestHandler>
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">content</str>
<str name="fmap.stream_size">size</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<!--<str name="fmap.a">links</str> -->
<!--<str name="fmap.div">ignored_div</str> -->
</lst>
</requestHandler>
参数解释:
fmap.source=target : 映射规则,将在pdf文件中提取出的字段(source) 映射到solr中的字段(tar)
uprefix : 如果指定了该参数,任何在schema中未定义的字段,都将以该参数指定的值作为字段名前缀
defaultField : 如果没有指定uprefix参数值,且有字段无法在schema中无法检测到,则使用defaultField指定的字段名
captureAttr :(true|false)捕获属性,对Tika XHTML 元素的属性进行索引。
literal:自定义metadata信息,也就是给schema文件中定义的某一个字段指定一个值
提交文档进行索引:
curl "http://localhost:8983/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=ignored_undefined" -F "commit=true" -F "file=@t2.pdf"
具体的参考文档:
注:对word文档的处理与pdf的方法一样哦