nutch install

zoukankan html css js c++ java

nutch install
on ubuntu
http://wiki.apache.org/nutch/NutchTutorial#A1_Setup_Nutch_from_binary_distribution

还是要看官方的文档，之前google了一个早期1.0版本的中文安装教程，直接误入歧途

Steps

1 Setup Nutch from binary distribution
- Unzip your binary Nutch package to $HOME/nutch-1.3
- cd $HOME/nutch-1.3/runtime/local
From now on, we am going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory.

2. Verify your Nutch installation
- run "bin/nutch" - You can confirm a correct installation if you seeing the following:
```
Usage: nutch [-core] COMMAND
```
Some troubleshooting tips:
- Run the following command if you are seeing "Permission denied":
```
chmod +x bin/nutch
```
- Setup JAVA_HOME if you are seeing JAVA_HOME not set. On Mac, you can run the following command or add it to ~/.bashrc:
```
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
```
3. Crawl your first website
- Add your agent name in the value field of the http.agent.name property in conf/nutch-site.xml, for example:
```
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
```
- mkdir -p urls
- create a text file nutch under /urls with the following content (1 url per line for each site you want Nutch to crawl).
```
http://nutch.apache.org/
```
* Edit the file conf/regex-urlfilter.txt and replace
```
# accept anything else +.
```
with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to the nutch.apache.org domain, the line should read:
```
 +^http://([a-z0-9]*\.)*nutch.apache.org/
```
This will include any url in the domain nutch.apache.org.

3.1 Using the Crawl Command

Now we are ready to initiate a crawl, use the following parameters:
- -dir dir names the directory to put the crawl in.
- -threads threads determines the number of threads that will fetch in parallel.
- -depth depth indicates the link depth from the root page that should be crawled.
- -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.
- Run the following command:
```
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
```
- Now you should be able to see the following directories created:
```
crawl/crawldb Crawl/linkdb crawl/segments
```
查看全文

相关阅读:
LeetCode113. 路径总和 II
LeetCode257. 二叉树的所有路径
 LeetCode222. 完全二叉树的节点个数
 LeetCode404. 左叶子之和
 LeetCode110. 平衡二叉树
 LeetCode101. 对称二叉树
 LeetCode100. 相同的树
 llustrator CC2017下载AI2020
vs code 代码格式化整理
 人生格言

原文地址：https://www.cnblogs.com/lexus/p/2206831.html

Steps

1 Setup Nutch from binary distribution

2. Verify your Nutch installation

3. Crawl your first website

3.1 Using the Crawl Command