zoukankan      html  css  js  c++  java
  • nutch2.2.1+mysql抓取数据

     基本环境:linux centos6.5 nutch2.2.1 源码包, mysql 5.5 ,elasticsearch1.1.1, jdk1.7

    1、下载地址http://mirror.bjtu.edu.cn/apache/nutch/2.2.1/ 解压

    2、修改数据存储方式是mysql

      修改nutch根目录/ivy/ivy.xml文件,原来mysql数据存储是注释的。

       
       <dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
    104     <!-- Uncomment this to use SQL as Gora backend. It should be noted that the 
    105     gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should 
    106     downgrade to gora-core 0.2.1 in order to use SQL as a backend. -->
    107 
    108     <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
    109 
    110     <!-- Uncomment this to use MySQL as database with SQL as Gora store. -->
    111 
    112     <dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default">    
    3、修改连接数据库地址和用户名,在 nutch根目录/conf/gora.properties 将原来的注释掉
      
    #gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
    #gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
    #gora.sqlstore.jdbc.user=sa
    #gora.sqlstore.jdbc.password=
    # MySQL properties #
    ###############################
    gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
    gora.sqlstore.jdbc.url=jdbc:mysql://ip:3306/nutch? useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNull
    gora.sqlstore.jdbc.user=user
    gora.sqlstore.jdbc.password=pwd
    

    4、修改修改conf的nutch-site.xml

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
     
    <configuration>
    <property>
    <name>http.agent.name</name>
    <value>My Spider</value>
    </property>
     
    <property>
    <name>http.accept.language</name>
    <value>ja-jp,zh-cn,en-us,en-gb,en;q=0.7,*;q=0.3</value>
    </property>
     
    <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
    <description>The character encoding to fall back to when no other information
    is available</description>
    </property>
     
    <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.sql.store.SqlStore</value>
    </property>
     
    <property>
    <name>plugin.includes</name>
    <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
    </property>
     
    </configuration>

    5、使用ant 编译源码

      在nutch 目录下执行 ant

    job:
          [jar] Building jar: /home/hadoop/nutch221/build/apache-nutch-2.2.1.job
    
    runtime:
        [mkdir] Created dir: /home/hadoop/nutch221/runtime
        [mkdir] Created dir: /home/hadoop/nutch221/runtime/local
        [mkdir] Created dir: /home/hadoop/nutch221/runtime/deploy
         [copy] Copying 1 file to /home/hadoop/nutch221/runtime/deploy
         [copy] Copying 2 files to /home/hadoop/nutch221/runtime/deploy/bin
         [copy] Copying 1 file to /home/hadoop/nutch221/runtime/local/lib
         [copy] Copying 1 file to /home/hadoop/nutch221/runtime/local/lib/native
         [copy] Copying 26 files to /home/hadoop/nutch221/runtime/local/conf
         [copy] Copying 2 files to /home/hadoop/nutch221/runtime/local/bin
         [copy] Copying 100 files to /home/hadoop/nutch221/runtime/local/lib
         [copy] Copying 106 files to /home/hadoop/nutch221/runtime/local/plugins
         [copy] Copied 2 empty directories to 2 empty directories under /home/hadoop/nutch221/runtime/local/test
    
    BUILD SUCCESSFUL
    Total time: 41 seconds     编译成功。

    6 创建数据库

    CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; 
     
    CREATE TABLE `webpage` (`id` varchar(767) CHARACTER SET latin1 NOT NULL,
    `headers` blob,
    `text` mediumtext DEFAULT NULL,
    `status` int(11) DEFAULT NULL,
    `markers` blob,
    `parseStatus` blob,
    `modifiedTime` bigint(20) DEFAULT NULL,
    `score` float DEFAULT NULL,
    `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
    `baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
    `content` mediumblob,
    `title` varchar(2048) DEFAULT NULL,
    `reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
    `fetchInterval` int(11) DEFAULT NULL,
    `prevFetchTime` bigint(20) DEFAULT NULL,
    `inlinks` mediumblob,
    `prevSignature` blob,
    `outlinks` mediumblob,
    `fetchTime` bigint(20) DEFAULT NULL,
    `retriesSinceFetch` int(11) DEFAULT NULL,
    `protocolStatus` blob,
    `signature` blob,
    `metadata` blob,
    PRIMARY KEY (`id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
    7、执行爬行操作:
    bin/nutch crawl urls -depth 3
     
    执行完在mysql中即可以查看到爬虫抓取的内容
    8、执行索引操作:
    bin/nutch elasticindex clustername -all
     
    遇到问题:在执行第7步的时候出现 异常:
    hadoop@master bin]$ nutch crawl urls -depth 3
    Exception in thread "main" java.lang.ClassNotFoundException: org.apache.gora.sql.store.SqlStore
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:190)
        at org.apache.nutch.storage.StorageUtils.getDataStoreClass(StorageUtils.java:89)
        at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:73)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221)
        at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
        at org.apache.nutch.crawl.Crawler.run(Crawler.java:136)
        at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

    #####################

    参照网上资料:http://blog.sina.com.cn/s/blog_3c9872d00101p4f0.html 还是没有解决。

    #官方解决办法:

    #http://mail-archives.apache.org/mod_mbox/nutch-user/201307.mbox/%3CCAErFeLSwoZ2UhxMA1iYi7H-L52Ojo-j9KoWT7xDittBzvB0F0A@mail.gmail.com%3E

    ######################

    20141103

    问题解决办法:重新编译一下即可

    又出现一个新的问题:

    ./nutch crawl ../urls -depth 3
    InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
    Exception in thread "main" java.lang.RuntimeException: job failed: name=inject ../urls, jobid=job_local713211278_0001
        at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
        at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)

     ./nutch crawl ../urls -depth 3 -topN 5
    InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
    Exception in thread "main" java.lang.RuntimeException: job failed: name=inject ../urls, jobid=job_local1302478362_0001
        at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)

    文章参考:

    官网资料:http://nlp.solutions.asia/?p=362

    https://issues.apache.org/jira/browse/NUTCH-1473

  • 相关阅读:
    mongodb 简单的更新语句
    centos 安装ffmpeg 及h264编码打包
    mongodb $where查询
    javascript 上传进度条
    javascript 仿豆瓣读书笔记
    js监听浏览器剪贴板
    ffmpeg相关操作
    ffmpeg未整理好,有时间整理下
    fffmpeg 提取pcm
    ffmpeg转MP4 moov头在前命令
  • 原文地址:https://www.cnblogs.com/zhanggl/p/3968130.html
Copyright © 2011-2022 走看看