Run Nutch In Eclipse on Linux and Windows nutch version 1.0
在Windows或Linux下通过eclipse 运行Nutch
Tested with
Nutch release 1.0
Eclipse 3.3 (Europa) and 3.4 (Ganymede)
Java 1.6
Ubuntu (should work on most platforms though)
Windows XP and Vista
无需翻译,表示测试环境
Before you start
Setting up Nutch to run into Eclipse can be tricky, and most of the time it is much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. Sometimes examining the logs (logs/hadoop.log) is quicker to debug a problem.
在你开始前:
设置Nutch使其运行在Eclipse过于复杂,但好过在在eclipse编辑Nutch但是在命令行下运行。不过,在eclipse下debug Nutch是一件非常舒服的事情。很多时候,查看日志(logs/hadoop.log)是调试程序的捷径。
Steps
步骤:
For Windows Users
If you are running Windows (tested on Windows XP) you must first install cygwin. Download it from
Install cygwin and set the PATH environment variable for it. You can set it from the Control Panel, System, Advanced Tab, Environment Variables and edit/add PATH.
Example PATH:
对于Windows用户:
如果你运行在Windows环境下(无语了,都是对Windows用户了还加上一个如果),你必须安装cygwin.
安装后要设定PATH环境变量。你可以在控制面板和诸多地方设置它。例如:
C:\Sun\SDK\bin;C:\cygwin\bin
If you run "bash" from the Windows command line (Start > Run... > cmd.exe) it should successfully run cygwin.
设置后,如果你运行“
bash”(Windows下的命令行)就可以很随意的运行cygwin(随意是只在任何目录下)If you are running Eclipse on Vista, you will need to either give cygwin administrative privileges or turn off Vista's User Access Control (UAC) . Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler:
如果你运行eclipse在vista下,你需要给cygwin一个管理员权限或者关掉UAC(强烈鄙视UAC 除了给用户增加麻烦外看不出有什么用处 病毒照样绕过运行)否则编译器会抱怨它无法获得许可当你运行crawl而且他们需要建立更改目录的时候。org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions of ... Permission deniedSee this for more information about the UAC issue.
类似权限问题可以通过
chmod赋予。具体百度。谷歌。
Install Nutch
Grab a fresh release of Nutch 1.0 or download and untar the official 1.0 release .
Do not build Nutch yet. Make sure you have no .project and .classpath files in the Nutch directory
安装Nutch
下载编译好的文件或者源代码,别
build Nutch除非你确定目录下不含有.project 和 .classpath文件。
Create a new Java Project in Eclipse
在
Eclipse中File > New > Project > Java project > click Next
Name the project (Nutch_Trunk for instance)
Select "Create project from existing source" and use the location where you downloaded Nutch
选择从已有的代码中创建,并且指向你存放
Nutch的地方。
Click on Next, and wait while Eclipse is scanning the folders
点击下一步,等待
Eclipse扫描文件夹Add the folder "conf" to the classpath (Right-click on the project, select "properties" then "Java Build Path" tab (left menu) and then the "Libraries" tab. Click "Add Class Folder..." button, and select "conf" from the list)
增加文件夹
"conf"进 classpath(右键项目,选择属性-Java 构建路径标签---库标签:点击 添加类文件夹..按钮,选择列表中"conf")Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the top (by checking it and clicking the "Top" button). This is required so Eclipse will take config (nutch-default.xml, nutch-final.xml, etc.) resources from our "conf" folder and not from somewhere else.
切换到“排序和导出”标签,找到
"conf"文件夹,并且移动那个到顶端(为了检测)这是Eclipse能够使用从我们的"conf"读取配置文件(nutch-default.xml nutch-final.xml之类的)的必须要求。Eclipse should have guessed all the Java files that must be added to your classpath. If that's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
Eclipse应该会自动的把所有的Java文件加入到你的classpath中。如果不是这种情况,请手动加入
Click the "Source" tab and set the default output folder to "Nutch_Trunk/bin/tmp_build". (You may need to create the tmp_build folder.)
Click the "Finish" button
点击
"Source"标签,并且设置默认的输出文件夹到
Configure NutchDO NOT add "build" to classpath记住!不能把build加入到classpath中!!!!
配置
NutchSee the Tutorial
Change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch-defaul.xml
Make sure Nutch is configured correctly before testing it into Eclipse
在
/conf/nutch-defaul.xm l中,将"plugin.folders"改为./src/plugin.
确定
Nutch已经被完全正确的配置后,再在Eclipse中测试Missing org.farng and com.etranslate
找不到
org.farng 和 com.etranslateEclipse will complain about some import statements in parse-mp3 and parse-rtf plugins (30 errors in my case). Because of incompatibility with the Apache license, the .jar files that define the necessary classes were not included with the source code.
Eclipse给出parse-mp3和parse-rtf plugins一些非常重要的异常声明。这是因为与Apache license协议不相容,一些.jar文件被定义在一些必需的类中却没有被源代码说包含。
Download them here:
在这里下载这两个文件
Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder > Build Path > Configure Build Path... Then select the Libraries tab, click "Add Jars..." and then add each .jar file individually. If that does not work, you may try clicking "Add External JARs" and the point to the two the directories above).
分别拷贝
jar文件到src/plugin/parse-mp3/lib 和 src/plugin/parse-rtf/lib/。并且把这些jar添加到构建路径中(首先通过F5刷新项目,接着右键项目选择属性>构建路径>配置构建路径>选择库标签,点击“增加jars文件”,分别加入上面两个文件。如果这样没有工作,你可能需要点击“增加外部jars”)
Two Errors with RTFParseFactory
两个错误关于 rtfParseFactory的
If you are trying to build the official 1.0 release, Eclipse will complain about 2 errors regarding the RTFParseFactory (this is after adding the RTF jar file from the previous step). This problem was fixed (see NUTCH-644 and NUTCH-705 ) but was not included in the 1.0 official release because of licensing issues. So you will need to manually alter the code to remove these 2 build errors.
如果你试图编译
official 1.0 release,Eclipse会提示两个错误关于rtfParseFactory(这需要完成前面所述的添加RTF.jar文件之后)In RTFParseFactory.java:
在
RTFParseFactory.java:Add the following import statement: import org.apache.nutch.parse.ParseResult;
Change
添加重要的声明
:import org.apache.nutch.parse.ParseResult;Change
public Parse getParse(Content content) {to
public ParseResult getParse(Content content) {In the getParse function, replace
return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParse(conf);with
return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParseResult(content.getUrl(), getConf());In the getParse function, replace
return new ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata));with
return ParseResult.createParseResult(content.getUrl(),
new ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata)));In TestRTFParser.java, replace
parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content);with
parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content).get(urlString);Once you have made these changes and saved the files, Eclipse should build with no errors.
Build Nutch
编译Nutch
If you setup the project correctly, Eclipse will build Nutch for you into "tmp_build". See below for problems you could run into.
如果设定项目的步骤正确,
Eclipse会在"tmp_build"文件夹下建立Nutch,查看Create Eclipse launcher
Menu Run > "Run..."
create "New" for "Java Application"
在运行配置中新建一个
java applicationset in Main class
在main class中设置:org.apache.nutch.crawl.Crawlon
tab Arguments, Program Arguments
在变量标签下设置
urls -dir crawl -depth 3 -topN 50in
VM arguments
虚拟变量下设置-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
click on "Run"
点击运行
if all works, you should see Nutch getting busy at crawling
如果都工作了,你可以看到
Nutch忙碌的爬行。
Debug Nutch in Eclipse (not yet tested for 0.9)
在eclipse下调试nutch
Set breakpoints and debug a crawl
设置断点和调试蜘蛛
It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints:
选择合适的断点位置是需要脑筋的。这是因为
hadoop的工作任务。这里列出了一些很不错的位置用来设置断点。Fetcher [line: 371] - run
Fetcher [line: 438] - fetch
Fetcher$FetcherThread [line: 149] - run()
Generator [line: 281] - generate
Generator$Selector [line: 119] - map
OutlinkExtractor [line: 111] - getOutlinks
If things do not work...
如果没有正常工作。
Yes, Nutch and Eclipse can be a difficult companionship sometimes
很多时候,
Nutch和Eclipse是很难拥有美妙的兼容友谊的
Java Heap Size problem
Java 堆大小问题
If the crawler throws an IOException exception early in the crawl (Exception in thread "main" java.io.IOException: Job failed!), check the logs/hadoop.log file for further information. If you find in hadoop.log lines similar to this:
如果
crawler抛出IOException 异常在准备爬行的时候(类似:2009-04-13 13:41:06,105 WARN mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap spacethen you should increase amount of RAM for running applications from Eclipse.
Just set it in:
请这样设定:
Eclipse -> Window -> Preferences -> Java -> Installed JREs -> edit -> Default VM arguments
Eclipse->窗口->首选项->java->安装 JREs->编辑->默认
VM 变量
I've set mine to
让我们看看我的
:-Xms5m -Xmx150m
because I have like 200MB RAM left after running all apps
这是因为我还有200MB的内存在运行所有的程序之后。
-Xms (minimum ammount of RAM memory for running applications)
-Xms(运行程序最小的内存限制)
-Xmx (maximum)
(最大的)
Eclipse: Cannot create project content in workspace
Eclipse:在workspace没有足够空间建立项目
The nutch source code must be out of the workspace folder. My first attempt was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine.
Nutch的代码不能包含在workspace文件夹内。我第一个意图是通过eclipse的svn下载源代码。但我试图使用存在的代码建立项目的时候,eclipse不让我从workspace的源代码建立项目。但我把代码移出workspace的时候,就正常工作了。
plugin dir not found
Plugin目录找不到
Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-defaults.xml or may be better in nutch-site.xml
确定更改了plugin文件的权限。你可以用绝对路径代替相对路径并在nutch-defaults.xml文件定义好过nutch-site.xml<property>
<name>plugin.folders</name>
<value>/home/....../nutch-0.9/src/plugin</value>
No plugins loaded during unit tests in Eclipse
在eclipse单元测试的时候没有插件被装载
During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.
在进行unit测试的时候,eclipse会忽略conf/nutch-site.xml而查看src/test/nutch-site.xml,所有你需要在这个文件中增加plugin的目录。NOTE: Additional note for people who want to run eclipse with latest nutch code
注意!给那些希望在eclipse上
If you are getting following exception
如果你捕获了下面的异常
:org.apache.nutch.plugin.PluginRuntimeException : java.lang.ClassNotFoundException : org.apache.nutch.net .urlnormalizer.basic.BasicURLNormalizer
Execute 'ant job' (which is the default) after downloading nutch through SVN
Update "plugin.folders" (under nutch-default.xml) to build/plugins (where ant builds plugins)
执行
"ant job"(默认的那个)在通过SVN更新了"plugin.folders"(在 nutch-default.xml)来构建插件。If it still fails increase your memory allocation or find a simpler website to crawl.
如果错误依旧,那么请增加你的内存或者爬行一个相对简单的网站。
Unit tests work in eclipse but fail when running ant in the command line
Unit测试在eclipse中工作,但是在命令行下运行ant时失败。
Suppose your unit tests work perfectly in eclipse, but each and everyone fail when running ant test in the command line - including the ones you haven't modified.
假设你的
unit测试在eclipse下工作完美,但是每一个都在运行ant测试的时候失败,包括你没有做任何修改。Check if you defined the plugin.folders property in hadoop-site.xml. In that case, try removing it from that file and adding it directly to nutch-site.xml
检测你是否在
hadoop-site.xml定义了plugin.folders。如果这样请移出掉这个定义。并且在立即在nutch-site.xml中定义。Run ant test again. That should have solved the problem.
运行并再次测试。大概会解决这个情况。
If that didn't solve the problem, are you testing a plugin? If so, did you add the plugin to the list of packages in plugin\build.xml, on the test target?
如果这样还是没有解决问题,你是否测试了插件?如果是,你是否把插件增加到package列表中在plugin\build.xml再测试目标前。
classNotFound
找不到类
open the class itself, rightclick
打开类自己,右键。
refresh the build dir
刷新编译目录。
debugging hadoop classes
调试hadoop类
Sometime it makes sense to also have the hadoop classes available during debugging. So, you can check out the Hadoop sources on your machine and add the sources to the hadoop-xxx.jar. Alternatively, you can:
Remove the hadoopXXX.jar from your classpath libraries
有时候会在调试的时候检测到
hadoop类依然可用。所有你可以检测在你电脑上hadoop的代码,增加代码到hadoop-xxx.jar,二者选一,你可以删除hadoopxxx.jar从你的classpath库中。
Checkout the hadoop brunch that is used within nutch
configure a hadoop project similar to the nutch project within your eclipse
在eclipse上
add the hadoop project as a dependent project of nutch project
以nutch项目附属的方式增加hadoop项目
you can now also set break points within hadoop classes lik inputformat implementations etc.
你也可以在hadoop 类中设置中断来实现一些类似inputformat功能。
Failed to get the current user's information
获得current user's信息失败
On Windows, if the crawler throws an exception complaining it
在windows下,如果crawler抛出异常如下:
"Failed to get the current user's information" or 'Login failed: Cannot run program "bash"',
it is likely you forgot to set the PATH to point to cygwin. Open a new command line window (All Programs > Accessories > Command Prompt) and type "bash". This should start cygwin. If it doesn't, type "path" to see your path. You should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See the steps to adding this to your PATH at the top of the article under "For Windows Users". After setting the PATH, you will likely need to restart Eclipse so it will use the new PATH.
这是你没有正确设置全局变量。(具体百度)
Original credits: RenaudRichardet
翻译:空缘
能力不足,错误之处请见谅。
欢迎交流:guhuatian<a>gmail.com