运行:
1,下载jspider-0.5.0-dev.zip,解压缩.
2,开始->运行->cmd,进入命令行窗口,进入jspider-0.5.0-dev/bin目录
3, 试着抓取网站http: //j-spider.sourceforge.net的内容:
jspider http: //j-spider.sourceforge.net >> out.txt
可以看见屏幕上显示:
JSpider v0.5.0 DEV (http://j-spider.sourceforge.net)
Build: 20030502
Started from .
[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration 'default'
Build: 20030502
Started from .
[Engine] jspider.home=..
[Engine] default output folder=..\output
[Engine] starting with configuration 'default'
bin目录下多了两个文件,out.txt, velocity.log.
out.txt内容如下:
------------------------------------------------------------
JSpider startup script
JSPIDER_HOME=..
------------------------------------------------------------
INFO [core.impl.PluginFactory] Loading 4 plugins.
INFO [core.impl.PluginFactory] Loading plugin configuration 'console'
INFO [mod.plugin.console.ConsolePlugin] Prefix set to '[Plugin] '
INFO [core.impl.PluginFactory] Plugin not configured for local event filtering
INFO [core.impl.PluginFactory] Plugin Name : Console writer JSpider module
INFO [core.impl.PluginFactory] Plugin Version : v1.0
INFO [core.impl.PluginFactory] Plugin Vendor : http://www.javacoding.net
INFO [core.impl.PluginFactory] Loading plugin configuration 'velocity'
INFO [core.impl.PluginFactory] Plugin uses local event filtering
INFO [core.impl.PluginFactory] Plugin Name : Velocity Template JSpider module
INFO [core.impl.PluginFactory] Plugin Version : v1.0
INFO [core.impl.PluginFactory] Plugin Vendor : http://www.javacoding.net
INFO [core.impl.PluginFactory] Loading plugin configuration 'statusbasedfilewriter'
INFO [mod.plugin.statusbasedfilewriter.StatusBasedFileWriterPlugin] initialized.
INFO [core.impl.PluginFactory] Plugin not configured for local event filtering
INFO [core.impl.PluginFactory] Plugin Name : Status based Filewriter JSpider plugin
INFO [core.impl.PluginFactory] Plugin Version : v1.0
INFO [core.impl.PluginFactory] Plugin Vendor : http://www.javacoding.net
INFO [core.impl.PluginFactory] Loading plugin configuration 'xmldump'
INFO [core.impl.PluginFactory] Plugin uses local event filtering
INFO [core.impl.PluginFactory] Plugin Name : Velocity Template JSpider module
INFO [core.impl.PluginFactory] Plugin Version : v1.0
INFO [core.impl.PluginFactory] Plugin Vendor : http://www.javacoding.net
INFO [core.impl.PluginFactory] Loaded 4 plugins.
INFO [mod.plugin.velocity.VelocityPlugin] writing trace file: true
INFO [mod.plugin.velocity.VelocityPlugin] writing dump file: true
INFO [mod.plugin.velocity.VelocityPlugin] Velocity template folder : velocity
INFO [mod.plugin.velocity.VelocityPlugin] Writing to trace file: ./velocity-trace.out
INFO [mod.plugin.velocity.VelocityPlugin] Writing to dump file: ./velocity-dump.out
INFO [mod.plugin.velocity.VelocityPlugin] writing trace file: false
INFO [mod.plugin.velocity.VelocityPlugin] writing dump file: true
INFO [mod.plugin.velocity.VelocityPlugin] Velocity template folder : xmldump
INFO [mod.plugin.velocity.VelocityPlugin] Writing to dump file: ./xml-dump.xml
INFO [core.storage.StorageFactory] Storage provider class is 'class net.javacoding.jspider.core.storage.memory.InMemoryStorageProvider'
INFO [core.SpiderContext] default user Agent is 'JSpider v0.5.0-dev (http://j-spider.sourceforge.net)'
INFO [core.task.SchedulerFactory] TaskScheduler provider class is 'class net.javacoding.jspider.core.task.impl.DefaultSchedulerProvider'
INFO [core.Spider] Spider born - threads: spiders: 5, thinkers: 1
[Plugin] Module : Console writer JSpider module
[Plugin] Version: v1.0
[Plugin] Vendor : http://www.javacoding.net
[Plugin] Spidering Started, baseURL = http://j-spider.sourceforge.net
INFO [core.SpiderContext] using userAgent 'JSpider v0.5.0-dev (http://j-spider.sourceforge.net)' for site 'http://j-spider.sourceforge.net'
[Plugin] site discovered : http://j-spider.sourceforge.net
[Plugin] resource discovered: http://j-spider.sourceforge.net
INFO [core.throttle.ThrottleFactory] Throttle provider class is 'class net.javacoding.jspider.core.throttle.impl.DistributedLoadThrottleProvider'
[Plugin] Job monitor: 0% (0/1) [S:0% (0/1) | T:0% (0/0)] [blocked:1] [assigned:1]
[Plugin] resource discovered: http://j-spider.sourceforge.net/robots.txt
[Plugin] 200 - http://j-spider.sourceforge.net/robots.txt - text/plain 527 461 ms
INFO [mod.plugin.statusbasedfilewriter.StatusBasedFileWriterPlugin] creating file for status '200'
[Plugin] robots.txt fetched from site [Site: http://j-spider.sourceforge.net - ROBOTSTXT_HANDLED *]
[Plugin] net.javacoding.jspider.api.event.site.UserAgentObeyedEvent obeyed rules for useragent 'JSpider' as found in robots.txt on site http://j-spider.sourceforge.net
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:20% [idle: 80%, blocked: 20%, busy: 0%], size: 5
[Plugin] Job monitor: 66% (2/3) [S:50% (1/2) | T:100% (1/1)] [blocked:0] [assigned:1]
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:20% [idle: 80%, blocked: 0%, busy: 20%], size: 5
……
[Plugin] Job monitor: 66% (2/3) [S:50% (1/2) | T:100% (1/1)] [blocked:0] [assigned:1]
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:20% [idle: 80%, blocked: 0%, busy: 20%], size: 5
[Plugin] 200 - http://j-spider.sourceforge.net - text/html 5687 9673 ms
[Plugin] resource discovered: http://j-spider.sourceforge.net/css/ie.css
[Plugin] resource discovered: http://j-spider.sourceforge.net/img/grey.gif
……
[Plugin] resource discovered: http://j-spider.sourceforge.net/img/title_information.gif
INFO [core.SpiderContext] site http://www.sourceforge.net must not be handled.
[Plugin] site discovered : http://www.sourceforge.net
……
[Plugin] http://j-spider.sourceforge.net parsed (handled)
[Plugin] 200 - http://j-spider.sourceforge.net/css/ie.css - text/css 114 440 ms
[Plugin] http://j-spider.sourceforge.net/css/ie.css - Ignored for parsing
[Plugin] Job monitor: 58% (29/50) [S:12% (3/24) | T:100% (26/26)] [blocked:0] [assigned:6]
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:100% [idle: 0%, blocked: 100%, busy: 0%], size: 5
[Plugin] 200 - http://j-spider.sourceforge.net/img/grey.gif - image/gif 49 441 ms
[Plugin] http://j-spider.sourceforge.net/img/grey.gif - Ignored for parsing
[Plugin] Job monitor: 60% (31/51) [S:16% (4/24) | T:100% (27/27)] [blocked:0] [assigned:6]
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:100% [idle: 0%, blocked: 100%, busy: 0%], size: 5
ERROR [core.task.work.SpiderHttpURLTask] exception during spidering
java.io.IOException
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at net.javacoding.jspider.core.task.work.SpiderHttpURLTask.execute(Unknown Source)
at net.javacoding.jspider.core.threading.WorkerThread.run(Unknown Source)
Caused by: java.io.FileNotFoundException: http://j-spider.sourceforge.net/img/logo.gif
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
2 more
[Plugin] 404 - ERROR !!!http://j-spider.sourceforge.net/img/logo.gif
INFO [mod.plugin.statusbasedfilewriter.StatusBasedFileWriterPlugin] creating file for status '404'
[Plugin] Job monitor: 62% (32/51) [S:20% (5/24) | T:100% (27/27)] [blocked:0] [assigned:6]
……
[Plugin] 200 - http://j-spider.sourceforge.net/img/title_other.gif - image/gif 1044 9403 ms
[Plugin] http://j-spider.sourceforge.net/img/title_other.gif - Ignored for parsing
INFO [core.Spider] Stopped spider workers
INFO [core.Spider] Stopped thinker workers
[Plugin]
SPIDERING SUMMARY :
known urls . : 47
visited urls .. : 27
parsed urls : 11
parse ignored urls .. : 16
parse error urls . : 0
not visited urls . : 20
fetching ignored urls .. : 19
forbidden urls : 0
fetch error urls . : 1
not yet visited urls .. : 0
[Plugin] Spidering Stopped
INFO [mod.plugin.velocity.VelocityPlugin] writing dump - this could take a while
INFO [mod.plugin.velocity.VelocityPlugin] writing dump - this could take a while
[Plugin] Job monitor: 100% (92/92) [S:100% (28/28) | T:100% (64/64)] [blocked:0] [assigned:0]
INFO [core.Spider] Spidering done!
INFO [core.Spider] Elapsed time : 46127
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 5
JSpider startup script
JSPIDER_HOME=..
------------------------------------------------------------
INFO [core.impl.PluginFactory] Loading 4 plugins.
INFO [core.impl.PluginFactory] Loading plugin configuration 'console'
INFO [mod.plugin.console.ConsolePlugin] Prefix set to '[Plugin] '
INFO [core.impl.PluginFactory] Plugin not configured for local event filtering
INFO [core.impl.PluginFactory] Plugin Name : Console writer JSpider module
INFO [core.impl.PluginFactory] Plugin Version : v1.0
INFO [core.impl.PluginFactory] Plugin Vendor : http://www.javacoding.net
INFO [core.impl.PluginFactory] Loading plugin configuration 'velocity'
INFO [core.impl.PluginFactory] Plugin uses local event filtering
INFO [core.impl.PluginFactory] Plugin Name : Velocity Template JSpider module
INFO [core.impl.PluginFactory] Plugin Version : v1.0
INFO [core.impl.PluginFactory] Plugin Vendor : http://www.javacoding.net
INFO [core.impl.PluginFactory] Loading plugin configuration 'statusbasedfilewriter'
INFO [mod.plugin.statusbasedfilewriter.StatusBasedFileWriterPlugin] initialized.
INFO [core.impl.PluginFactory] Plugin not configured for local event filtering
INFO [core.impl.PluginFactory] Plugin Name : Status based Filewriter JSpider plugin
INFO [core.impl.PluginFactory] Plugin Version : v1.0
INFO [core.impl.PluginFactory] Plugin Vendor : http://www.javacoding.net
INFO [core.impl.PluginFactory] Loading plugin configuration 'xmldump'
INFO [core.impl.PluginFactory] Plugin uses local event filtering
INFO [core.impl.PluginFactory] Plugin Name : Velocity Template JSpider module
INFO [core.impl.PluginFactory] Plugin Version : v1.0
INFO [core.impl.PluginFactory] Plugin Vendor : http://www.javacoding.net
INFO [core.impl.PluginFactory] Loaded 4 plugins.
INFO [mod.plugin.velocity.VelocityPlugin] writing trace file: true
INFO [mod.plugin.velocity.VelocityPlugin] writing dump file: true
INFO [mod.plugin.velocity.VelocityPlugin] Velocity template folder : velocity
INFO [mod.plugin.velocity.VelocityPlugin] Writing to trace file: ./velocity-trace.out
INFO [mod.plugin.velocity.VelocityPlugin] Writing to dump file: ./velocity-dump.out
INFO [mod.plugin.velocity.VelocityPlugin] writing trace file: false
INFO [mod.plugin.velocity.VelocityPlugin] writing dump file: true
INFO [mod.plugin.velocity.VelocityPlugin] Velocity template folder : xmldump
INFO [mod.plugin.velocity.VelocityPlugin] Writing to dump file: ./xml-dump.xml
INFO [core.storage.StorageFactory] Storage provider class is 'class net.javacoding.jspider.core.storage.memory.InMemoryStorageProvider'
INFO [core.SpiderContext] default user Agent is 'JSpider v0.5.0-dev (http://j-spider.sourceforge.net)'
INFO [core.task.SchedulerFactory] TaskScheduler provider class is 'class net.javacoding.jspider.core.task.impl.DefaultSchedulerProvider'
INFO [core.Spider] Spider born - threads: spiders: 5, thinkers: 1
[Plugin] Module : Console writer JSpider module
[Plugin] Version: v1.0
[Plugin] Vendor : http://www.javacoding.net
[Plugin] Spidering Started, baseURL = http://j-spider.sourceforge.net
INFO [core.SpiderContext] using userAgent 'JSpider v0.5.0-dev (http://j-spider.sourceforge.net)' for site 'http://j-spider.sourceforge.net'
[Plugin] site discovered : http://j-spider.sourceforge.net
[Plugin] resource discovered: http://j-spider.sourceforge.net
INFO [core.throttle.ThrottleFactory] Throttle provider class is 'class net.javacoding.jspider.core.throttle.impl.DistributedLoadThrottleProvider'
[Plugin] Job monitor: 0% (0/1) [S:0% (0/1) | T:0% (0/0)] [blocked:1] [assigned:1]
[Plugin] resource discovered: http://j-spider.sourceforge.net/robots.txt
[Plugin] 200 - http://j-spider.sourceforge.net/robots.txt - text/plain 527 461 ms
INFO [mod.plugin.statusbasedfilewriter.StatusBasedFileWriterPlugin] creating file for status '200'
[Plugin] robots.txt fetched from site [Site: http://j-spider.sourceforge.net - ROBOTSTXT_HANDLED *]
[Plugin] net.javacoding.jspider.api.event.site.UserAgentObeyedEvent obeyed rules for useragent 'JSpider' as found in robots.txt on site http://j-spider.sourceforge.net
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:20% [idle: 80%, blocked: 20%, busy: 0%], size: 5
[Plugin] Job monitor: 66% (2/3) [S:50% (1/2) | T:100% (1/1)] [blocked:0] [assigned:1]
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:20% [idle: 80%, blocked: 0%, busy: 20%], size: 5
……
[Plugin] Job monitor: 66% (2/3) [S:50% (1/2) | T:100% (1/1)] [blocked:0] [assigned:1]
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:20% [idle: 80%, blocked: 0%, busy: 20%], size: 5
[Plugin] 200 - http://j-spider.sourceforge.net - text/html 5687 9673 ms
[Plugin] resource discovered: http://j-spider.sourceforge.net/css/ie.css
[Plugin] resource discovered: http://j-spider.sourceforge.net/img/grey.gif
……
[Plugin] resource discovered: http://j-spider.sourceforge.net/img/title_information.gif
INFO [core.SpiderContext] site http://www.sourceforge.net must not be handled.
[Plugin] site discovered : http://www.sourceforge.net
……
[Plugin] http://j-spider.sourceforge.net parsed (handled)
[Plugin] 200 - http://j-spider.sourceforge.net/css/ie.css - text/css 114 440 ms
[Plugin] http://j-spider.sourceforge.net/css/ie.css - Ignored for parsing
[Plugin] Job monitor: 58% (29/50) [S:12% (3/24) | T:100% (26/26)] [blocked:0] [assigned:6]
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:100% [idle: 0%, blocked: 100%, busy: 0%], size: 5
[Plugin] 200 - http://j-spider.sourceforge.net/img/grey.gif - image/gif 49 441 ms
[Plugin] http://j-spider.sourceforge.net/img/grey.gif - Ignored for parsing
[Plugin] Job monitor: 60% (31/51) [S:16% (4/24) | T:100% (27/27)] [blocked:0] [assigned:6]
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:100% [idle: 0%, blocked: 100%, busy: 0%], size: 5
ERROR [core.task.work.SpiderHttpURLTask] exception during spidering
java.io.IOException
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at net.javacoding.jspider.core.task.work.SpiderHttpURLTask.execute(Unknown Source)
at net.javacoding.jspider.core.threading.WorkerThread.run(Unknown Source)
Caused by: java.io.FileNotFoundException: http://j-spider.sourceforge.net/img/logo.gif
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
2 more
[Plugin] 404 - ERROR !!!http://j-spider.sourceforge.net/img/logo.gif
INFO [mod.plugin.statusbasedfilewriter.StatusBasedFileWriterPlugin] creating file for status '404'
[Plugin] Job monitor: 62% (32/51) [S:20% (5/24) | T:100% (27/27)] [blocked:0] [assigned:6]
……
[Plugin] 200 - http://j-spider.sourceforge.net/img/title_other.gif - image/gif 1044 9403 ms
[Plugin] http://j-spider.sourceforge.net/img/title_other.gif - Ignored for parsing
INFO [core.Spider] Stopped spider workers
INFO [core.Spider] Stopped thinker workers
[Plugin]
SPIDERING SUMMARY :
known urls . : 47
visited urls .. : 27
parsed urls : 11
parse ignored urls .. : 16
parse error urls . : 0
not visited urls . : 20
fetching ignored urls .. : 19
forbidden urls : 0
fetch error urls . : 1
not yet visited urls .. : 0
[Plugin] Spidering Stopped
INFO [mod.plugin.velocity.VelocityPlugin] writing dump - this could take a while
INFO [mod.plugin.velocity.VelocityPlugin] writing dump - this could take a while
[Plugin] Job monitor: 100% (92/92) [S:100% (28/28) | T:100% (64/64)] [blocked:0] [assigned:0]
INFO [core.Spider] Spidering done!
INFO [core.Spider] Elapsed time : 46127
[Plugin] ThreadPool Thinkers occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 1
[Plugin] ThreadPool Spiders occupation:0% [idle: 100%, blocked: 0%, busy: 0%], size: 5
可以看出,具体的spider,parse,dump动作都是由插件实现的。velocity.log是velocity插件的日志。
默认输出目录是output,进去可以看见7个文件: 200.out, 404.out, log4j.out, README.txt, velocity-dump.out, velocity-trace.out, xml-dump.xml;这些文件记录的就是扫描结果。*.out是文本文件,可以用文本编辑器打开.output里面没有http页面,也就是说默认配置不保存抓取下来的页面.
配置文件在conf目录下.配置文件有两种:
(1)*.properties――程序配置文件,配置程序的行为
如conf\default\plugins\download\sites.properties文件内容:
# -----------------------------------------------------------------------------
# Websites configuration file
# -----------------------------------------------------------------------------
#
# $Id: sites.properties,v 1.4 2003/04/25 21:28:55 vanrogu Exp $
#
# -----------------------------------------------------------------------------
jspider.site.config.base=base
jspider.site.config.default=skip
# Websites configuration file
# -----------------------------------------------------------------------------
#
# $Id: sites.properties,v 1.4 2003/04/25 21:28:55 vanrogu Exp $
#
# -----------------------------------------------------------------------------
jspider.site.config.base=base
jspider.site.config.default=skip
(2)*.vm――输出格式配置文件,配置程序的输出。JSpider采用的是第三方工具velocity。
*.vm是velocity模板文件。如,conf\default\plugins\velocity\engineSpideringStoppedEvent.vm:
[${eventName}]
known urls . : ${event.summary.known}
visited urls .. : ${event.summary.visited}
parsed urls : ${event.summary.parsed}
parse ignored urls .. : ${event.summary.ignoredForParsing}
parse error urls . : ${event.summary.parseErrors}
not visited urls . : ${event.summary.notVisited}
fetching ignored urls .. : ${event.summary.ignoredForFetching}
forbidden urls : ${event.summary.forbidden}
fetch error urls . : ${event.summary.fetchErrors}
not yet visited urls .. : ${event.summary.unvisited}
known urls . : ${event.summary.known}
visited urls .. : ${event.summary.visited}
parsed urls : ${event.summary.parsed}
parse ignored urls .. : ${event.summary.ignoredForParsing}
parse error urls . : ${event.summary.parseErrors}
not visited urls . : ${event.summary.notVisited}
fetching ignored urls .. : ${event.summary.ignoredForFetching}
forbidden urls : ${event.summary.forbidden}
fetch error urls . : ${event.summary.fetchErrors}
not yet visited urls .. : ${event.summary.unvisited}