zoukankan      html  css  js  c++  java
  • Apache Jackrabbit源码研究(一)

    几年前某位大牛写了 深入浅出 jackrabbit 系列,链接地址为http://ahuaxuan.iteye.com/category/65829

    本人读后受益匪浅(如果没用他的辅助之功,本人对jackrabbit的理解可能会摸索得更长),由于时隔久远,当时的jackrabbit版本为1.7,与现在的最新版本有点出入,本人抑制不住内心某种无名冲动,不顾自己理解上的肤浅,将自己对Apache Jackrabbit的源码解析记录下来,以期加深对编程的理解,或许有助于后来者

    (注:本文目前可能还处于修改中,如需转载,害人害己)

    jackrabbit对富文档的文本提取目前版本是通过apache tika实现的,这是与以前的版本不同的

    实现该功能主要是LazyTextExtractorField类,该类继承自lucene的抽象类AbstractField

    LazyTextExtractorField类的源码如下:

    /**
     * <code>LazyTextExtractorField</code> implements a Lucene field with a String
     * value that is lazily initialized from a given {@link Reader}. In addition
     * this class provides a method to find out whether the purpose of the reader
     * is to extract text and whether the extraction process is already finished.
     *
     * @see #isExtractorFinished()
     */
    public class LazyTextExtractorField extends AbstractField {
    
        /**
         * The logger instance for this class.
         */
        private static final Logger log =
            LoggerFactory.getLogger(LazyTextExtractorField.class);
    
        /**
         * The exception used to forcibly terminate the extraction process
         * when the maximum field length is reached.
         */
        private static final SAXException STOP =
            new SAXException("max field length reached");
    
        /**
         * The extracted text content of the given binary value.
         * Set to non-null when the text extraction task finishes.
         */
        private volatile String extract = null;
    
        /**
         * Creates a new <code>LazyTextExtractorField</code> with the given
         * <code>name</code>.
         *
         * @param name the name of the field.
         * @param reader the reader where to obtain the string from.
         * @param highlighting set to <code>true</code> to
         *                     enable result highlighting support
         */
        public LazyTextExtractorField(
                Parser parser, InternalValue value, Metadata metadata,
                Executor executor, boolean highlighting, int maxFieldLength) {
            super(FieldNames.FULLTEXT,
                    highlighting ? Store.YES : Store.NO,
                    Field.Index.ANALYZED,
                    highlighting ? TermVector.WITH_OFFSETS : TermVector.NO);
            executor.execute(
                    new ParsingTask(parser, value, metadata, maxFieldLength));
        }
    
        /**
         * Returns the extracted text. This method blocks until the text
         * extraction task has been completed.
         *
         * @return the string value of this field
         */
        public synchronized String stringValue() {
            try {
                while (!isExtractorFinished()) {
                    wait();
                }
                return extract;
            } catch (InterruptedException e) {
                log.error("Text extraction thread was interrupted", e);
                return "";
            }
        }
    
        /**
         * @return always <code>null</code>
         */
        public Reader readerValue() {
            return null;
        }
    
        /**
         * @return always <code>null</code>
         */
        public byte[] binaryValue() {
            return null;
        }
    
        /**
         * @return always <code>null</code>
         */
        public TokenStream tokenStreamValue() {
            return null;
        }
    
        /**
         * Checks whether the text extraction task has finished.
         *
         * @return <code>true</code> if the extracted text is available
         */
        public boolean isExtractorFinished() {
            return extract != null;
        }
    
        private synchronized void setExtractedText(String value) {
            extract = value;
            notify();
        }
    
        /**
         * Releases all resources associated with this field.
         */
        public void dispose() {
            // TODO: Cause the ContentHandler below to throw an exception
        }
    
        /**
         * The background task for extracting text from a binary value.
         */
        private class ParsingTask extends DefaultHandler implements Runnable {
    
            private final Parser parser;
    
            private final InternalValue value;
    
            private final Metadata metadata;
    
            private final int maxFieldLength;
    
            private final StringBuilder builder = new StringBuilder();
    
            public ParsingTask(
                    Parser parser, InternalValue value, Metadata metadata,
                    int maxFieldLength) {
                this.parser = parser;
                this.value = value;
                this.metadata = metadata;
                this.maxFieldLength = maxFieldLength;
            }
    
            public void run() {
                try {
                    InputStream stream = value.getStream();
                    try {
                        parser.parse(stream, this, metadata, new ParseContext());
                    } finally {
                        stream.close();
                    }
                } catch (Throwable t) {
                    if (t != STOP) {
                        log.warn("Failed to extract text from a binary property", t);
                    }
                } finally {
                    value.discard();
                }
                setExtractedText(builder.toString());
            }
    
            @Override
            public void characters(char[] ch, int start, int length)
                    throws SAXException {
                builder.append(
                        ch, start,
                        Math.min(length, maxFieldLength - builder.length()));
                if (builder.length() >= maxFieldLength) {
                    throw STOP;
                }
            }
    
            @Override
            public void ignorableWhitespace(char[] ch, int start, int length)
                    throws SAXException {
                characters(ch, start, length);
            }
    
        }
    
    }

    从代码可以发现,富文档文本提取的工作是放在线程类ParsingTask中进行处理的,文本提取是通过异步方式进行的

    这里的线程类同时继承自DefaultHandler,DefaultHandler实现了EntityResolver, DTDHandler, ContentHandler, ErrorHandler四接口,这是一种缺省适配器模式,为我们实现target目标接口提供便利

    jaxp规范对xml格式文件的解析式基于事件监听模式,上面最主要的接口是ContentHandler,ParsingTask间接实现了该接口,同时将获取的文本增量累加在private final StringBuilder builder = new StringBuilder()对象里面

    线程方法里面最后通过调用setExtractedText(builder.toString())方法提交得到的文本

    需要注意的是,这里的parser对象,jackrabbit并没有使用原生的apache tika里面的类,而是封装了一个JackrabbitParser类

    JackrabbitParser类的源码如下:

    /**
     * Jackrabbit wrapper for Tika parsers. Uses a Tika {@link AutoDetectParser}
     * for all parsing requests, but sets it up with Jackrabbit-specific
     * configuration and implements backwards compatibility support for old
     * <code>textExtractorClasses</code> configurations.
     *
     * @since Apache Jackrabbit 2.0
     */
    class JackrabbitParser implements Parser {
    
        /**
         * Logger instance.
         */
        private static final Logger logger =
            LoggerFactory.getLogger(JackrabbitParser.class);
    
        /**
         * Flag for blocking all text extraction. Used by the Jackrabbit test suite.
         */
        private static volatile boolean blocked = false;
    
        /**
         * The configured Tika parser.
         */
        private final AutoDetectParser parser;
    
        /**
         * Creates a parser using the default Jackrabbit-specific configuration
         * settings.
         */
        public JackrabbitParser() {
            InputStream stream =
                JackrabbitParser.class.getResourceAsStream("tika-config.xml");
            try {
                if (stream != null) {
                    try {
                        parser = new AutoDetectParser(new TikaConfig(stream));
                    } finally {
                        stream.close();
                    }
                } else {
                    parser = new AutoDetectParser();
                }
            } catch (Exception e) {
                // Should never happen
                throw new RuntimeException(
                        "Unable to load embedded Tika configuration", e);
            }
        }
    
        /**
         * Backwards compatibility method to support old Jackrabbit 1.x
         * <code>textExtractorClasses</code> configurations. Implements a best
         * effort mapping from the old-style text extractor classes to
         * corresponding Tika parsers.
         *
         * @param classes configured list of text extractor classes
         */
        public void setTextFilterClasses(String classes) {
            Map<MediaType, Parser> parsers = new HashMap<MediaType, Parser>();
    
            StringTokenizer tokenizer = new StringTokenizer(classes, ", \t\n\r\f");
            while (tokenizer.hasMoreTokens()) {
                String name = tokenizer.nextToken();
                if (name.equals(
                        "org.apache.jackrabbit.extractor.HTMLTextExtractor")) {
                    parsers.put(MediaType.text("html"), new HtmlParser());
                } else if (name.equals("org.apache.jackrabbit.extractor.MsExcelTextExtractor")) {
                    Parser parser = new OfficeParser();
                    parsers.put(MediaType.application("vnd.ms-excel"), parser);
                    parsers.put(MediaType.application("msexcel"), parser);
                    parsers.put(MediaType.application("excel"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.MsOutlookTextExtractor")) {
                    parsers.put(MediaType.application("vnd.ms-outlook"), new OfficeParser());
                } else if (name.equals("org.apache.jackrabbit.extractor.MsPowerPointExtractor")
                        || name.equals("org.apache.jackrabbit.extractor.MsPowerPointTextExtractor")) {
                    Parser parser = new OfficeParser();
                    parsers.put(MediaType.application("vnd.ms-powerpoint"), parser);
                    parsers.put(MediaType.application("mspowerpoint"), parser);
                    parsers.put(MediaType.application("powerpoint"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.MsWordTextExtractor")) {
                    Parser parser = new OfficeParser();
                    parsers.put(MediaType.application("vnd.ms-word"), parser);
                    parsers.put(MediaType.application("msword"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.MsTextExtractor")) {
                    Parser parser = new OfficeParser();
                    parsers.put(MediaType.application("vnd.ms-word"), parser); 
                    parsers.put(MediaType.application("msword"), parser);
                    parsers.put(MediaType.application("vnd.ms-powerpoint"), parser);
                    parsers.put(MediaType.application("mspowerpoint"), parser);
                    parsers.put(MediaType.application("vnd.ms-excel"), parser);
                    parsers.put(MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.document"), parser);
                    parsers.put(MediaType.application("vnd.openxmlformats-officedocument.presentationml.presentation"), parser);
                    parsers.put(MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.OpenOfficeTextExtractor")) {
                    Parser parser = new OpenDocumentParser();
                    parsers.put(MediaType.application("vnd.oasis.opendocument.database"), parser);
                    parsers.put(MediaType.application("vnd.oasis.opendocument.formula"), parser);
                    parsers.put(MediaType.application("vnd.oasis.opendocument.graphics"), parser);
                    parsers.put(MediaType.application("vnd.oasis.opendocument.presentation"), parser);
                    parsers.put(MediaType.application("vnd.oasis.opendocument.spreadsheet"), parser);
                    parsers.put(MediaType.application("vnd.oasis.opendocument.text"), parser);
                    parsers.put(MediaType.application("vnd.sun.xml.calc"), parser);
                    parsers.put(MediaType.application("vnd.sun.xml.draw"), parser);
                    parsers.put(MediaType.application("vnd.sun.xml.impress"), parser);
                    parsers.put(MediaType.application("vnd.sun.xml.writer"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.PdfTextExtractor")) {
                    parsers.put(MediaType.application("pdf"), new PDFParser());
                } else if (name.equals("org.apache.jackrabbit.extractor.PlainTextExtractor")) {
                    parsers.put(MediaType.TEXT_PLAIN, new TXTParser());
                } else if (name.equals("org.apache.jackrabbit.extractor.PngTextExtractor")) {
                    Parser parser = new ImageParser();
                    parsers.put(MediaType.image("png"), parser);
                    parsers.put(MediaType.image("apng"), parser);
                    parsers.put(MediaType.image("mng"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.RTFTextExtractor")) {
                    Parser parser = new RTFParser();
                    parsers.put(MediaType.application("rtf"), parser);
                    parsers.put(MediaType.text("rtf"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.XMLTextExtractor")) {
                    Parser parser = new XMLParser();
                    parsers.put(MediaType.APPLICATION_XML, parser);
                    parsers.put(MediaType.text("xml"), parser);
                } else {
                    logger.warn("Ignoring unknown text extractor class: {}", name);
                }
            }
    
            parser.setParsers(parsers);
        }
    
        /**
         * Delegates the call to the configured {@link AutoDetectParser}.
         */
        public Set<MediaType> getSupportedTypes(ParseContext context) {
            return parser.getSupportedTypes(context);
        }
    
        /**
         * Delegates the call to the configured {@link AutoDetectParser}.
         */
        public void parse(
                InputStream stream, ContentHandler handler,
                Metadata metadata, ParseContext context)
                throws IOException, SAXException, TikaException {
            waitIfBlocked();
            parser.parse(stream, handler, metadata, context);
        }
    
        public void parse(
                InputStream stream, ContentHandler handler, Metadata metadata)
                throws IOException, SAXException, TikaException {
            parse(stream, handler, metadata, new ParseContext());
        }
    
        /**
         * Waits until text extraction is no longer blocked. The block is only
         * ever activated in the Jackrabbit test suite when testing delayed
         * text extraction.
         *
         * @throws TikaException if the block was interrupted
         */
        private synchronized static void waitIfBlocked() throws TikaException {
            try {
                while (blocked) {
                    JackrabbitParser.class.wait();
                }
            } catch (InterruptedException e) {
                throw new TikaException("Text extraction block interrupted", e);
            }
        }
    
        /**
         * Blocks all text extraction tasks.
         */
        static synchronized void block() {
            blocked = true;
        }
    
        /**
         * Unblocks all text extraction tasks.
         */
        static synchronized void unblock() {
            blocked = false;
            JackrabbitParser.class.notifyAll();
        }
    
    }

    具体的文本解析工作是通过委托给AutoDetectParser类来执行的,如果看过我以前的apache tika源码研究,就可以知道AutoDetectParser类继承自CompositeParser类,而CompositeParser类的处理方式是通过调用它的Parser聚集来完成具体的解析工作,这里面 实现的是composite模式(自顶向下的安全式的composite模式)

    ---------------------------------------------------------------------------

    本系列Apache Jackrabbit源码研究系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/03/2997156.html

  • 相关阅读:
    windows本地文件搜索神器 Everything 为什么速度这么快?
    Electron构建跨平台应用
    「前端进阶」高性能渲染十万条数据(虚拟列表)
    Chrome开发者工具之JavaScript内存分析
    网页性能管理详解
    TCP-IP详解:滑动窗口(Sliding Window)
    滑动窗口
    流量控制(滑动窗口)和 拥塞控制(拥塞控制的工作过程)
    详解 Git 大文件存储(Git LFS)
    TCP流量控制
  • 原文地址:https://www.cnblogs.com/chenying99/p/2997156.html
Copyright © 2011-2022 走看看