编程工程
$ cd ./anthelion/anthelion/target/classes
$ java -Xmx15G -cp ../Anthelion-1.0.0-jar-with-dependencies.jar com.yahoo.research.robme.anthelion.simulation.CCFakeCrawler ./index ./network ./label ../../config/baseline.properties result.log
Necessary files:
- index: the mapping between ID and URL
- network: the graph including the IDs from the index
- label: list of the IDs which fulfil the target function
- properties: configuration file (a set of configuration files can be found in the resource folder of the distribution)
- result: the location where the information about the performance and the crawling process are stored
The files which we used to measure the performance when crawling for HTML pages including Microdata, Microformats and RDFa can be found on the dedicated page of the WebDataCommons project: http://webdatacommons.org/structureddata/anthelion/
Available actions within the simulation process:
- Run "init" to initialize the crawler (loading the network, labels and create the features).
- Run "start" to start the crawler and simulate a crawl. Output is written to the result.log
- Use "stop" to stop the simulation
- Run "exit" to shut down
- Use "status" to observe the crawling process.