zoukankan      html  css  js  c++  java
  • WEB数据挖掘(十六)——Aperture数据抽取(9):数据源

    One of the central concepts of Aperture is the notion of a DataSource. A DataSource contains all information necessary to locate the individual information resources in a physical source. For example, a FileSystemDataSource holds a root directory, a set of patterns that describe what files to include or exclude, a maximum depth, etc., thereby effectively describing a set of files.
    One of the main purposes of a DataSource is to hold all data needed by a Crawlers to crawl the physical source and retrieve all the individual resources in it. There are quite a few DataSource subclasses in Aperture. The following diagram contains a selection of them.


    The specific DataSource implementations available at the moment contain specific 'get' and 'set' methods for the configuration properties accepted by the data source. Thus providing a convenient interface and abstracting from the underlying RDF properties. All configuration data is stored in a RDFContainer. Each data source type comes with it's own specific properties. There is also a set of generic properties used by many data source types (username, password etc.). You can have a look at the source code of the DataSource implementation class of your choosing to see which properties are used. Note that the data source classes are not stored in the SVN. They are generated automatically from an RDF file with the description of the class. (like this one). The classes are generated by a maven plugin, by adding appropriate entries in the datasource module pom.xml file similar to these. If you'd like to develop your own data source implementation, try to mimic the existing implementations or ask at the aperture-devel for help.

    It is worth mentioning, that DataSource classes only DESCRIBE a data source. They don't contain any resources that would enable direct access to the source (such as InputStreams, or Readers, whatever...). (At least it was not the intention of the designers). Any such resource is obtained by the crawler at the start of crawl and may be encapsulated in a DataObject returned by an Accessor or crawler. The following code demonstrates how to create and configure a FileSystemDataSource:

    // determine the root folder of the source
    File rootFolder = new File("D:\path\to\the\root\folder");
    // create the model that will store the data source configure
    Model model = RDF2Go.getModelFactory().createModel();
    // don't forget to open it before use
    model.open();
    // determine a URI to identify the DataSource
    URI id = model.createURI("urn:test:testsource");
    // wrap the model in an RDFContainer
    RDFContainer configuration = new RDFContainerImpl(model,id);
    // create the DataSource instance
    FileSystemDataSource source = new FileSystemDataSource();
    // set the configuration (it is empty at the moment)
    source.setConfiguration(configuration)
    // and set the rootFolder (you can do it now)
    source.setRootFolder(rootFolder.getAbsolutePath());
    
  • 相关阅读:
    把特斯拉送上火星的程序员,马斯克!
    简练软考知识点整理-激励理论之成就动机理论
    简练软考知识点整理-激励理论之期望理论
    简练软考知识点整理-激励理论之XY理论
    简练软考知识点整理-激励理论之赫兹伯格双因素理论
    2017白领年终奖调查出炉,程序员扎心了!
    简练软考知识点整理-项目人力资源管理之马斯洛需要层次理论
    fsck获取文件的block信息和位置信息
    Snapshots常用命令
    Inceptor Parse error [Error 1110] line 102,24 SQL问题
  • 原文地址:https://www.cnblogs.com/chenying99/p/3194778.html
Copyright © 2011-2022 走看看