zoukankan      html  css  js  c++  java
  • WEB数据挖掘(十一)——Aperture数据抽取(7):在Aperture中重要的API

    本人认为,如果介绍Aperture抽象的API,恐怕使人不知所云;抽象的API失去具体的上下文显得有点苍白。人们认识事物的方式从源头上而言总是从特殊到一般,从具体到抽象 。基于此,本文还是实现具有上下文的example

    本文先来演示一下一个简单的数据抽取程序,基本流程是:

     1根据InputStream识别文件的mime类型;

     2 根据识别的mime类型获取ExtractorFactory,进一步获取Extractor

    3 调用 Extractor的extract方法填充RDFContainer

     4 输出Model的RDF格式

    代码示例如下:

    public class ExtractorExample {
        public static void main(String[] args) throws Exception {
            // create a MimeTypeIdentifier
            MimeTypeIdentifier identifier = new MagicMimeTypeIdentifier();
            // create an ExtractorRegistry containing all available
            // ExtractorFactories
            ExtractorRegistry extractorRegistry = new DefaultExtractorRegistry();
            // read as many bytes of the file as desired by the MIME type identifier
            File file = new File("/home/chenying/web/news1.html");
            FileInputStream stream = new FileInputStream(file);
            BufferedInputStream buffer = new BufferedInputStream(stream);
            byte[] bytes = IOUtil.readBytes(buffer, identifier.getMinArrayLength());
            stream.close();
            // let the MimeTypeIdentifier determine the MIME type of this file
            String mimeType = identifier.identify(bytes, file.getPath(), null);
            // skip when the MIME type could not be determined
            if (mimeType == null) {
                System.err.println("MIME type could not be established.");
                return;
            }
            //System.out.println("mimeType:"+mimeType);
            // create the RDFContainer that will hold the RDF model
            URI uri = new URIImpl(file.toURI().toString());
            Model model = RDF2Go.getModelFactory().createModel();
            model.open();
            RDFContainer container = new RDFContainerImpl(model, uri);
            // determine and apply an Extractor that can handle this MIME type
            Set factories=extractorRegistry.getExtractorFactories(mimeType);
            //Set factories = extractorRegistry.get(mimeType);
            if (factories != null && !factories.isEmpty()) {
                // just fetch the first available Extractor
                ExtractorFactory factory = (ExtractorFactory) factories.iterator().next();
                Extractor extractor = factory.get();
     
                // apply the extractor on the specified file
                // (just open a new stream rather than buffer the previous stream)
                stream = new FileInputStream(file);
                buffer = new BufferedInputStream(stream, 8192);
                extractor.extract(uri, buffer, Charset.forName("utf-8"), mimeType, container);
                stream.close();
            }
            // add the MIME type as an additional statement to the RDF model
            container.add(NIE.mimeType, mimeType);
            // report the output to System.out
            //container.getModel().writeTo(new PrintWriter(System.out),Syntax.Ntriples);
            container.getModel().writeTo(new PrintWriter(System.out),Syntax.RdfXml);
        }
    }

    运行上面的类,会在eclipse的控制台看到Model的Syntax.RdfXml格式的输出,本人的输出如下:

    <?xml version="1.0" encoding="UTF-8"?>
    <rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    
    <rdf:Description rdf:about="file:/home/chenying/web/news1.html">
        <rdf:type rdf:resource="http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#HtmlDocument"/>
        <plainTextContent xmlns="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#">本文件为测试的解析文件
     </plainTextContent>
        <mimeType xmlns="http://www.semanticdesktop.org/ontologies/2007/01/19/nie#">text/html</mimeType>
    </rdf:Description>
    
    </rdf:RDF>

    上面的example里面一些核心功能很多是通过手工编写的,如果Aperture的功能是如此不智能,那我们不免为之泄气;不过本人认为,简单的example是了解高级应用的门径,不如此,往往使我们迷失于高级应用的迷宫。

    下面本人来点稍微高级点的example,基本流程是:

    1创建Model

    2 创建RDFContainer包装Model

    3 创建FileSystemDataSource,设置相应属性

    4创建FileSystemCrawler并设置DataSource,DataAccessorRegistry和CrawlerHandler(回调处理)

    5 FileSystemCrawler的调用crawl方法

    代码示例如下:

    public class TutorialCrawlingExample {
    
        public static void main(String[] args) throws Exception {
            // create a new ExampleFileCrawler instance
            TutorialCrawlingExample crawler = new TutorialCrawlingExample();
    
            if (args.length != 1) {
                System.err.println("Specify the root folder");
                System.exit(-1);
            }      
    // start crawling and exit afterwards crawler.doCrawling(new File(args[0])); } public void doCrawling(File rootFile) throws Exception { // create a model that will store the data source configuration Model model = RDF2Go.getModelFactory().createModel(); // open the model model.open(); // .. and wrap it in an RDFContainer RDFContainer configuration = new RDFContainerImpl(model, new URIImpl("source:testSource"), false); // now create the data source FileSystemDataSource source = new FileSystemDataSource(); // and set the configuration container source.setConfiguration(configuration); // now we can call the type-specific setters in each DataSource class source.setRootFolder(rootFile.getAbsolutePath()); // setup a crawler that can handle this type of DataSource FileSystemCrawler crawler = new FileSystemCrawler(); crawler.setDataSource(source); crawler.setDataAccessorRegistry(new DefaultDataAccessorRegistry()); crawler.setCrawlerHandler(new TutorialCrawlerHandler()); // start crawling crawler.crawl(); } } class TutorialCrawlerHandler extends CrawlerHandlerBase { // our 'persistent' modelSet private ModelSet modelSet; public TutorialCrawlerHandler() throws ModelException { super (new MagicMimeTypeIdentifier(), new DefaultExtractorRegistry(), new DefaultSubCrawlerRegistry()); modelSet = RDF2Go.getModelFactory().createModelSet(); modelSet.open(); } public void crawlStopped(Crawler crawler, ExitCode exitCode) { try { //modelSet.writeTo(System.out, Syntax.Trix); modelSet.writeTo(System.out, Syntax.RdfXml); } catch (Exception e) { throw new RuntimeException(e); } finally { modelSet.close(); } } public RDFContainer getRDFContainer(URI uri) { // we create a new in-memory temporary model for each data source Model model = RDF2Go.getModelFactory().createModel(uri); // A model needs to be opened before being wrapped in an RDFContainer model.open(); return new RDFContainerImpl(model, uri); } public void objectNew(Crawler crawler, DataObject object) { // first we try to extract the information from the binary file try { processBinary(crawler, object); } catch (Exception x) { // do some proper logging now in real applications x.printStackTrace(); } // then we add this information to our persistent model modelSet.addModel(object.getMetadata().getModel()); // don't forget to dispose of the DataObject object.dispose(); } public void objectChanged(Crawler crawler, DataObject object) { // first we remove old information about the data object modelSet.removeModel(object.getID()); // then we try to extract metadata and fulltext from the file try { processBinary(crawler, object); } catch (Exception x) { // do some proper logging now in real applications x.printStackTrace(); } // an then we add the information from the temporary model to our // 'persistent' model modelSet.addModel(object.getMetadata().getModel()); // don't forget to dispose of the DataObject object.dispose(); } public void objectRemoved(Crawler crawler, URI uri) { // an object has been removed, we delete it from the rdf store modelSet.removeModel(uri); } }

    设置参数后,运行上面的类,同样输出Model的Syntax.RdfXml格式

    在上面的示例中,我们并没有手动编程方式获取文件的InputStream,这些是Aperture自动完成的,其中TutorialCrawlerHandler是一个内部类,用于FileSystemCrawler对象示例的回调类,这种处理方式与java里面的jaxp规范中的sax方式解析xml文件有点类似,两者感觉是相通的。

    我们从TutorialCrawlerHandler类可以发现,Model对象被保存在ModelSet对象里面(大概是集合吧),其实我们在编写回调处理方法时也可以持久化到文件系统,具体详情在此不表。

    --------------------------------------------------------------------------- 

    本系列WEB数据挖掘系本人原创

    作者 博客园 刺猬的温驯 

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/06/15/3137067.html

    本文版权归作者所有,未经作者同意,严禁转载及用作商业传播,否则将追究法律责任。

  • 相关阅读:
    ~是什么意思 在C语言中,~0代表什么
    window中普通用户无法登录远程桌面
    服务器22端口被封锁的问题解决
    让hive的表注释和字段注释支持中文
    MySQL Workbench在archlinux中出现 Could not store password: The name org.freedesktop.secrets was not provided by any .service files的错误
    记使用talend从oracle抽取数据时,数字变为0的问题
    记mysql中时间相关的一个奇怪问题
    使用dbeaver查mysql的表会导致锁表的问题
    oracle中实现某个用户truncate 其它用户下的表
    Oracle中找出用户的上次登录时间
  • 原文地址:https://www.cnblogs.com/chenying99/p/3137067.html
Copyright © 2011-2022 走看看