1.
<dataConfig> <dataSource type="BinFileDataSource" /> <script><![CDATA[ function setIdType(row) { row.put('id', 'file::' + row.get('fileAbsolutePath')); row.put('type', 'file'); return row; } ]]></script> <document> <entity name="tika-test" processor="TikaEntityProcessor" url="C:UsersAdministratorDesktop测试素材URL URI.pdf" format="text" transformer="script:setIdType"> <field name="file_author" column="Author" meta="true" /> <field name="file_title" column="title" meta="true" /> <field name="file_text" column="text" /> </entity> </document> </dataConfig>
2.
<dataConfig> <script><![CDATA[ id = 1; function GenerateId(row) { row.put('id', (id ++).toFixed()); return row; } ]]></script> <dataSource type="BinURLDataSource" name="data"/> <dataSource type="URLDataSource" baseUrl="http://localhost/tmp/bin/" name="main"/> <document> <entity name="rec" processor="XPathEntityProcessor" url="data.xml" forEach="/albums/album" dataSource="main" transformer="script:GenerateId"> <field column="title" xpath="//title" /> <field column="description" xpath="//description" /> <entity processor="TikaEntityProcessor" url="http://localhost/tmp/bin/${rec.description}" dataSource="data"> <field column="text" name="content" /> <field column="Author" name="author" meta="true" /> <field column="title" name="title" meta="true" /> </entity> </entity> </document> </dataConfig>
3.
Solr配置Clob字段 <documentname="bulletin"> <entity name="item" pk="uuid" transformer="ClobTransformer" query="select * from no_bulletin"> <fieldcolumn="UUID"name="id"/> <fieldcolumn="CONTENT"name="content"clob="true"/> </entity> </document>
注:红色部分是配置clob字段必须的,CONTENT必须大些,否则ClobTransformer是不会被执行解析的。(query中的sql语句改成自己的)
Solr配置Blob字段
<dataSourcename="f1"type="FieldStreamDataSource"/> <dataSourcename="orcle"driver="oracle.jdbc.driver.OracleDriver"url="jdbc:oracle:thin:@192.168.196.253:1521:orcl"user="sample_bus"password="sample_bus"/> <document> <entitydataSource ="orcle"name="attach"query="select att_id,content from no_bul_attcontent where att_id='645cf16b40d4472ca649084c6aa099fe'"> <fieldcolumn="ATT_ID"name="id"/> <entitydataSource="f1"processor="TikaEntityProcessor"url="content" dataField="attach.CONTENT"> <fieldcolumn="text"name="docContent"/> </entity> </entity> </document>
注意:这里url没有作用,可以去掉(如果dataSource不是数据库,而是本地文件,那这里就是路径,如:url="d:/path ${f.fileAbsolutePath}"等等,f父实体的name),
如果url不对,报无效的sql语句错误。
dataField中attach是父实体的name。attach.CONTENT必须大写,否则报:No field available for name : attach.content Processing Document # 1.
特别注意:数据库中Blob字段名不能与schema.xml中对应的字段同名。否则,Bolb字段导入的结果为<str name="abc">oracle.sql.BLOB@1042c25</str>