zoukankan      html  css  js  c++  java
  • Nutch学习笔记一 ---环境搭建

    学习环境: ubuntu

    概要:

    Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。

    通过nutch,诞生了hadoop、tika、gora。

    先安装SVN和Ant环境。(通过编译源码方式来使用nutch)

    apt-get install ant
    apt-get install subversion


    hu@hu-VirtualBox:~/data/nutch$ svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.6/
    hu@hu-VirtualBox:~/data/nutch$ cd release-1.6/
    hu@hu-VirtualBox:~/data/nutch/release-1.6$ ant
    hu@hu-VirtualBox:~/data/nutch/release-1.6$ cd runtime/

    备注runtime目录下有两个目录,分别代表了nutch两种不同运行方式。deploy依赖hadoop。
    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime$ ls
    deploy  local

    那nutch和hadoop是通过什么连接起来的?
    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime$ ls deploy/
    apache-nutch-1.6.job  bin

    是通过nutch脚本。通过hadoop命令吧apache-nutch-1.6.job提交给hadoop的JobTracker。

    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime$ cd local/
    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ mkdir urls
    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ touch urls/url.txt
    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ vi urls/url.txt
    备注:urls/url.txt中输入爬取地址 http://blog.tianya.cn


    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch crawl
    Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]
    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ nohup ./bin/nutch crawl urls -dir data -threads 100 -depth 3 &

    备注:查看运行概要 hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ cat nohup.out
    查看运行详情 通过logs/hadoop.log文件

    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ls logs/
    hadoop.log

    通过查看nohup.out发现出现异常
    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ cat nohup.out
    solrUrl is not set, indexing will be skipped...
    crawl started in: data
    rootUrlDir = urls
    threads = 100
    depth = 3
    solrUrl=null
    Injector: starting at 2013-12-08 21:10:30
    Injector: crawlDb: data/crawldb
    Injector: urlDir: urls
    Injector: Converting injected urls to crawl db entries.
    solrUrl is not set, indexing will be skipped...
    crawl started in: data
    rootUrlDir = urls
    threads = 100
    depth = 3
    solrUrl=null
    Injector: starting at 2013-12-08 21:10:38
    Injector: crawlDb: data/crawldb
    Injector: urlDir: urls
    Injector: Converting injected urls to crawl db entries.
    Injector: total number of urls rejected by filters: 0
    Injector: total number of urls injected after normalization and filtering: 1
    Injector: Merging injected urls into crawl db.
    Injector: finished at 2013-12-08 21:10:53, elapsed: 00:00:14
    Generator: starting at 2013-12-08 21:10:53
    Generator: Selecting best-scoring urls due for fetch.
    Generator: filtering: true
    Generator: normalizing: true
    Generator: jobtracker is 'local', generating exactly one partition.
    Generator: Partitioning selected urls for politeness.
    Generator: segment: data/segments/20131208211101
    Generator: finished at 2013-12-08 21:11:08, elapsed: 00:00:15
    Fetcher: No agents listed in 'http.agent.name' property.
    Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
        at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1389)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1274)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

    【解决方案】
    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ vi conf/nutch-site.xml
    打开 conf/nutch-site.xml. 在nutch-site.xml中添加"http.agent.name"信息。 (conf/nutch-default.xml有默认配置信息)
    <configuration>
        <property>
          <name>http.agent.name</name>
          <value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0; WUID=11ec69f3ac129124d5a2480d127648e0; WTB=2938) Gecko/20100101 Firefox/20.0</value>
          <description>HTTP 'User-Agent' request header. MUST NOT be empty -
          please set this to a single word uniquely related to your organization.

          NOTE: You should also check other related properties:

            http.robots.agents
            http.agent.description
            http.agent.url
            http.agent.email
            http.agent.version

          and set their values appropriately.

          </description>
        </property>
    </configuration>

    (如果修改源文件中配置文件,即/release-1.6/conf/nutch-site.xml,在更改nutch配置文件之后,需要重新进行ant编译)

    hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ls data/
    crawldb  linkdb  segments

    下回再学关于查看抓取数据详细信息。

    总结:nutch的入门重点在于分析nutch脚本文件

    参考:

    http://yangshangchuan.iteye.com/category/275433

    http://www.oschina.net/translate/nutch-tutorial  Nutch 教程

  • 相关阅读:
    获得微软最具影响力开发者(GDI)
    推荐一个制作卡通头像的网站(超强)
    李煜词全集
    15款语言学习2.0网络服务
    SNS社么时候回归社交? !!
    公司附近雪景
    Powershell实践之Discuz!NT自动打包发布
    使用 Office Live 时 Install Office Live Update 1.2出错的解决办法
    修改linux swap空间的swappiness,降低对硬盘的缓存
    TFS "TF30063: 您没有权限访问 MicrosoftIIS/7.0."
  • 原文地址:https://www.cnblogs.com/huligong1234/p/3464371.html
Copyright © 2011-2022 走看看