zoukankan      html  css  js  c++  java
  • 企业搜索引擎开发之连接器connector(十八)

    创建并启动连接器实例之后,连接器就会基于Http协议向指定的数据接收服务器发送xmlfeed格式数据,我们可以通过配置http代理服务器抓取当前基于http协议格式的数据(或者也可以通过其他网络抓包工具抓取)

    // 设置代理
                /Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("IP地址", "端口"));
                synchronized (this) {
                    uc = (HttpURLConnection) feedUrl.openConnection();
                }

    如此设置之后,我们就可以打开代理工具清楚的观察到连接器发送的具体数据了

    POST http://127.0.0.1:8080/hedgehog-searchEngine/xmlfeed HTTP/1.1
    Content-Type: multipart/form-data; boundary=<<
    Cache-Control: no-cache
    Pragma: no-cache
    User-Agent: Java/1.6.0_45
    Host: 127.0.0.1:8080
    Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
    Connection: keep-alive
    Content-Length: 31621
    
    --<<
    Content-Disposition: form-data; name="datasource"
    Content-Type: text/plain
    
    default_collectionName_dbconnector_1401370320421
    --<<
    Content-Disposition: form-data; name="feedtype"
    Content-Type: text/plain
    
    incremental
    --<<
    Content-Disposition: form-data; name="data"
    Content-Type: text/xml
    
    <?xml version='1.0' encoding='UTF-8'?><!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "gsafeed.dtd">
    <gsafeed>
    <header>
    <datasource>default_collectionName_dbconnector_1401370320421</datasource>
    <feedtype>incremental</feedtype>
    </header>
    <group>
    <record url="googleconnector://default_collectionName_dbconnector_1401370320421.localhost/doc?docid=B/9795" displayurl="dbconnector://default_collectionName_dbconnector_1401370320421.localhost/B/9795" action="add" mimetype="text/html">
    <metadata>
    <meta name="google:displayurl" content="dbconnector://default_collectionName_dbconnector_1401370320421.localhost/B/9795"/>
    <meta name="google:mimetype" content="text/html"/>
    </metadata>
    <content encoding="base64binary">
    PGh0bWw+DQo8dGl0bGU+RGF0YWJhc2UgQ29ubmVjdG9yIFJlc3VsdCBkb2NJRD05Nzk1PC90aXRsZT4NCjxib2R5Pg0KPHRhYmxlIGJvcmRlcj0iMSI+DQo8dHIgYmdjb2xvcj0iIzlhY2QzMiI+DQo8dGg+ZG9jX3Rhc2tpZDwvdGg+PHRoPmRvY19zaXRlSUQ8L3RoPjx0aD5kb2NfZGF0ZTwvdGg+PHRoPmRvY190aXRsZTwvdGg+PHRoPmRvY19ocmVmPC90aD48dGg+ZG9jSUQ8L3RoPjx0aD5kb2NfY2F0ZUlEPC90aD48dGg+ZG9jX2NoaWxkY2F0ZUlEPC90aD4NCjwvdHI+DQo8dHI+DQo8dGQ+MTwvdGQ+PHRkPjE1PC90ZD48dGQ+MjAxMi0wOC0wMzwvdGQ+PHRkPuWMl+S6rOaWsOS4lue6qumlreW6l+WKnuWFrOalvDwvdGQ+PHRkPmh0dHA6Ly8yMTAuNzUuMjExLjUzL2djanN6bC5wclByb2plY3QucHJHQ0pTX1pMX1ZfUFJPSl9BUFBSX0lORk9fUVVFUlkuZG8/Y29kZT0zMTQ5NjImYW1wO3NlY1RhZz1wcm9qZWN0JmFtcDtzeXNvcmdhbmlkPTc1PC90ZD48dGQ+OTc5NTwvdGQ+PHRkPjE8L3RkPjx0ZD4xPC90ZD4NCjwvdHI+DQo8L3RhYmxlPg0KPC9ib2R5Pg0KPC9odG1sPg0K
    </content>
    </record>
    ……
    </group>
    </gsafeed>
    
    --<<--

    分析上面的数据格式,可以观察到发送方式为POST,发送元素项为datasource feedtype data(datasouce为连接实例名,feedtype表示增量信息,data即为xmlfeed数据)

    xmlfeed数据部分,我们可以参考官方的dtd文件

    <?xml version="1.0" encoding="UTF-8"?>
        <!ELEMENT gsafeed (header, group+)>
        <!ELEMENT header (datasource, feedtype)>
        <!-- datasource name should match the regex [a-zA-Z_][a-zA-Z0-9_-]*,
            the first character must be a letter or underscore,
            the rest of the characters can be alphanumeric, dash, or underscore. -->
        <!ELEMENT datasource (#PCDATA)>
        <!-- feedtype must be either 'full', 'incremental', or 'metadata-and-url' -->
        <!ELEMENT feedtype (#PCDATA)>
        <!-- group element lets you group records together and
            specify a common action for them -->
        <!ELEMENT group (record*)>
        <!-- record element can have attribute that overrides group's element-->
        <!ELEMENT record (metadata*,content*)>
        <!ELEMENT metadata (meta*)>
        <!ELEMENT meta EMPTY>
        <!ELEMENT content (#PCDATA)>
        <!-- last-modified date as per RFC822 -->
        <!-- default is 'add' -->
        <!ATTLIST group action (add|delete) "add">
        <!ATTLIST record
            url CDATA #REQUIRED
            displayurl CDATA #IMPLIED
            action (add|delete) #IMPLIED
            mimetype CDATA #IMPLIED
            last-modified CDATA #IMPLIED
            lock (true|false) "false"
            authmethod (none|httpbasic|ntlm|httpsso) "none">
        <!ATTLIST meta
            name CDATA #REQUIRED
            content CDATA #REQUIRED>
        <!-- if encoding is specified it must be base64binary as that is the only
            binary encoding that is supported -->
        <!ATTLIST content encoding (base64binary) #IMPLIED>

    接下来我们便可以在数据接收服务器端接收这些数据并解析之

    具体解析过程不再描述,读者可以参考下面的相关资料,本人推荐woodstox这款解析器(符合stax规范)

    使用 StAX 解析 XML,第 1 部分: Streaming API for XML (StAX) 简介 
    http://www.ibm.com/developerworks/cn/xml/x-stax1.html 

    使用 StAX 解析 XML,第 2 部分: 拉式解析和事件 
    http://www.ibm.com/developerworks/cn/xml/x-stax2.html 

    使用 StAX 解析 XML,第 3 部分: 使用定制事件和编写 XML 
    http://www.ibm.com/developerworks/cn/xml/x-stax3.html 

    Geronimo 叛逆者: 使用集成软件包:Codehaus 的 Woodstox 
    http://www.ibm.com/developerworks/cn/opensource/os-ag-renegade15/ 

    Woodstox官网

    http://woodstox.codehaus.org/ 

    ---------------------------------------------------------------------------

    本系列企业搜索引擎开发之连接器connector系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本人邮箱: chenying998179@163#com (#改为.)

    本文链接 http://www.cnblogs.com/chenying99/p/3765047.html 

  • 相关阅读:
    wince 下,拨号成功,但不能打开网页的问题
    Wince platform configure filesdetail
    WINCE ERRORMSG
    wince 6 s3c2440 io port opearation
    wince 6 kernel configure files
    赞cnblogs
    WINCE DEBUGMSG
    POJ3249 Test for Job DAG最短路
    HDU4552 怪盗基德的挑战书 KMP | 后缀数组 | 暴力
    HDU4554 叛逆的小明 水题
  • 原文地址:https://www.cnblogs.com/chenying99/p/3765047.html
Copyright © 2011-2022 走看看