zoukankan      html  css  js  c++  java
  • Apache Jackrabbit源码研究(一)

    几年前某位大牛写了 深入浅出 jackrabbit 系列,链接地址为http://ahuaxuan.iteye.com/category/65829

    本人读后受益匪浅(如果没用他的辅助之功,本人对jackrabbit的理解可能会摸索得更长),由于时隔久远,当时的jackrabbit版本为1.7,与现在的最新版本有点出入,本人抑制不住内心某种无名冲动,不顾自己理解上的肤浅,将自己对Apache Jackrabbit的源码解析记录下来,以期加深对编程的理解,或许有助于后来者

    (注:本文目前可能还处于修改中,如需转载,害人害己)

    jackrabbit对富文档的文本提取目前版本是通过apache tika实现的,这是与以前的版本不同的

    实现该功能主要是LazyTextExtractorField类,该类继承自lucene的抽象类AbstractField

    LazyTextExtractorField类的源码如下:

    /**
     * <code>LazyTextExtractorField</code> implements a Lucene field with a String
     * value that is lazily initialized from a given {@link Reader}. In addition
     * this class provides a method to find out whether the purpose of the reader
     * is to extract text and whether the extraction process is already finished.
     *
     * @see #isExtractorFinished()
     */
    public class LazyTextExtractorField extends AbstractField {
    
        /**
         * The logger instance for this class.
         */
        private static final Logger log =
            LoggerFactory.getLogger(LazyTextExtractorField.class);
    
        /**
         * The exception used to forcibly terminate the extraction process
         * when the maximum field length is reached.
         */
        private static final SAXException STOP =
            new SAXException("max field length reached");
    
        /**
         * The extracted text content of the given binary value.
         * Set to non-null when the text extraction task finishes.
         */
        private volatile String extract = null;
    
        /**
         * Creates a new <code>LazyTextExtractorField</code> with the given
         * <code>name</code>.
         *
         * @param name the name of the field.
         * @param reader the reader where to obtain the string from.
         * @param highlighting set to <code>true</code> to
         *                     enable result highlighting support
         */
        public LazyTextExtractorField(
                Parser parser, InternalValue value, Metadata metadata,
                Executor executor, boolean highlighting, int maxFieldLength) {
            super(FieldNames.FULLTEXT,
                    highlighting ? Store.YES : Store.NO,
                    Field.Index.ANALYZED,
                    highlighting ? TermVector.WITH_OFFSETS : TermVector.NO);
            executor.execute(
                    new ParsingTask(parser, value, metadata, maxFieldLength));
        }
    
        /**
         * Returns the extracted text. This method blocks until the text
         * extraction task has been completed.
         *
         * @return the string value of this field
         */
        public synchronized String stringValue() {
            try {
                while (!isExtractorFinished()) {
                    wait();
                }
                return extract;
            } catch (InterruptedException e) {
                log.error("Text extraction thread was interrupted", e);
                return "";
            }
        }
    
        /**
         * @return always <code>null</code>
         */
        public Reader readerValue() {
            return null;
        }
    
        /**
         * @return always <code>null</code>
         */
        public byte[] binaryValue() {
            return null;
        }
    
        /**
         * @return always <code>null</code>
         */
        public TokenStream tokenStreamValue() {
            return null;
        }
    
        /**
         * Checks whether the text extraction task has finished.
         *
         * @return <code>true</code> if the extracted text is available
         */
        public boolean isExtractorFinished() {
            return extract != null;
        }
    
        private synchronized void setExtractedText(String value) {
            extract = value;
            notify();
        }
    
        /**
         * Releases all resources associated with this field.
         */
        public void dispose() {
            // TODO: Cause the ContentHandler below to throw an exception
        }
    
        /**
         * The background task for extracting text from a binary value.
         */
        private class ParsingTask extends DefaultHandler implements Runnable {
    
            private final Parser parser;
    
            private final InternalValue value;
    
            private final Metadata metadata;
    
            private final int maxFieldLength;
    
            private final StringBuilder builder = new StringBuilder();
    
            public ParsingTask(
                    Parser parser, InternalValue value, Metadata metadata,
                    int maxFieldLength) {
                this.parser = parser;
                this.value = value;
                this.metadata = metadata;
                this.maxFieldLength = maxFieldLength;
            }
    
            public void run() {
                try {
                    InputStream stream = value.getStream();
                    try {
                        parser.parse(stream, this, metadata, new ParseContext());
                    } finally {
                        stream.close();
                    }
                } catch (Throwable t) {
                    if (t != STOP) {
                        log.warn("Failed to extract text from a binary property", t);
                    }
                } finally {
                    value.discard();
                }
                setExtractedText(builder.toString());
            }
    
            @Override
            public void characters(char[] ch, int start, int length)
                    throws SAXException {
                builder.append(
                        ch, start,
                        Math.min(length, maxFieldLength - builder.length()));
                if (builder.length() >= maxFieldLength) {
                    throw STOP;
                }
            }
    
            @Override
            public void ignorableWhitespace(char[] ch, int start, int length)
                    throws SAXException {
                characters(ch, start, length);
            }
    
        }
    
    }

    从代码可以发现,富文档文本提取的工作是放在线程类ParsingTask中进行处理的,文本提取是通过异步方式进行的

    这里的线程类同时继承自DefaultHandler,DefaultHandler实现了EntityResolver, DTDHandler, ContentHandler, ErrorHandler四接口,这是一种缺省适配器模式,为我们实现target目标接口提供便利

    jaxp规范对xml格式文件的解析式基于事件监听模式,上面最主要的接口是ContentHandler,ParsingTask间接实现了该接口,同时将获取的文本增量累加在private final StringBuilder builder = new StringBuilder()对象里面

    线程方法里面最后通过调用setExtractedText(builder.toString())方法提交得到的文本

    需要注意的是,这里的parser对象,jackrabbit并没有使用原生的apache tika里面的类,而是封装了一个JackrabbitParser类

    JackrabbitParser类的源码如下:

    /**
     * Jackrabbit wrapper for Tika parsers. Uses a Tika {@link AutoDetectParser}
     * for all parsing requests, but sets it up with Jackrabbit-specific
     * configuration and implements backwards compatibility support for old
     * <code>textExtractorClasses</code> configurations.
     *
     * @since Apache Jackrabbit 2.0
     */
    class JackrabbitParser implements Parser {
    
        /**
         * Logger instance.
         */
        private static final Logger logger =
            LoggerFactory.getLogger(JackrabbitParser.class);
    
        /**
         * Flag for blocking all text extraction. Used by the Jackrabbit test suite.
         */
        private static volatile boolean blocked = false;
    
        /**
         * The configured Tika parser.
         */
        private final AutoDetectParser parser;
    
        /**
         * Creates a parser using the default Jackrabbit-specific configuration
         * settings.
         */
        public JackrabbitParser() {
            InputStream stream =
                JackrabbitParser.class.getResourceAsStream("tika-config.xml");
            try {
                if (stream != null) {
                    try {
                        parser = new AutoDetectParser(new TikaConfig(stream));
                    } finally {
                        stream.close();
                    }
                } else {
                    parser = new AutoDetectParser();
                }
            } catch (Exception e) {
                // Should never happen
                throw new RuntimeException(
                        "Unable to load embedded Tika configuration", e);
            }
        }
    
        /**
         * Backwards compatibility method to support old Jackrabbit 1.x
         * <code>textExtractorClasses</code> configurations. Implements a best
         * effort mapping from the old-style text extractor classes to
         * corresponding Tika parsers.
         *
         * @param classes configured list of text extractor classes
         */
        public void setTextFilterClasses(String classes) {
            Map<MediaType, Parser> parsers = new HashMap<MediaType, Parser>();
    
            StringTokenizer tokenizer = new StringTokenizer(classes, ", \t\n\r\f");
            while (tokenizer.hasMoreTokens()) {
                String name = tokenizer.nextToken();
                if (name.equals(
                        "org.apache.jackrabbit.extractor.HTMLTextExtractor")) {
                    parsers.put(MediaType.text("html"), new HtmlParser());
                } else if (name.equals("org.apache.jackrabbit.extractor.MsExcelTextExtractor")) {
                    Parser parser = new OfficeParser();
                    parsers.put(MediaType.application("vnd.ms-excel"), parser);
                    parsers.put(MediaType.application("msexcel"), parser);
                    parsers.put(MediaType.application("excel"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.MsOutlookTextExtractor")) {
                    parsers.put(MediaType.application("vnd.ms-outlook"), new OfficeParser());
                } else if (name.equals("org.apache.jackrabbit.extractor.MsPowerPointExtractor")
                        || name.equals("org.apache.jackrabbit.extractor.MsPowerPointTextExtractor")) {
                    Parser parser = new OfficeParser();
                    parsers.put(MediaType.application("vnd.ms-powerpoint"), parser);
                    parsers.put(MediaType.application("mspowerpoint"), parser);
                    parsers.put(MediaType.application("powerpoint"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.MsWordTextExtractor")) {
                    Parser parser = new OfficeParser();
                    parsers.put(MediaType.application("vnd.ms-word"), parser);
                    parsers.put(MediaType.application("msword"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.MsTextExtractor")) {
                    Parser parser = new OfficeParser();
                    parsers.put(MediaType.application("vnd.ms-word"), parser); 
                    parsers.put(MediaType.application("msword"), parser);
                    parsers.put(MediaType.application("vnd.ms-powerpoint"), parser);
                    parsers.put(MediaType.application("mspowerpoint"), parser);
                    parsers.put(MediaType.application("vnd.ms-excel"), parser);
                    parsers.put(MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.document"), parser);
                    parsers.put(MediaType.application("vnd.openxmlformats-officedocument.presentationml.presentation"), parser);
                    parsers.put(MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.OpenOfficeTextExtractor")) {
                    Parser parser = new OpenDocumentParser();
                    parsers.put(MediaType.application("vnd.oasis.opendocument.database"), parser);
                    parsers.put(MediaType.application("vnd.oasis.opendocument.formula"), parser);
                    parsers.put(MediaType.application("vnd.oasis.opendocument.graphics"), parser);
                    parsers.put(MediaType.application("vnd.oasis.opendocument.presentation"), parser);
                    parsers.put(MediaType.application("vnd.oasis.opendocument.spreadsheet"), parser);
                    parsers.put(MediaType.application("vnd.oasis.opendocument.text"), parser);
                    parsers.put(MediaType.application("vnd.sun.xml.calc"), parser);
                    parsers.put(MediaType.application("vnd.sun.xml.draw"), parser);
                    parsers.put(MediaType.application("vnd.sun.xml.impress"), parser);
                    parsers.put(MediaType.application("vnd.sun.xml.writer"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.PdfTextExtractor")) {
                    parsers.put(MediaType.application("pdf"), new PDFParser());
                } else if (name.equals("org.apache.jackrabbit.extractor.PlainTextExtractor")) {
                    parsers.put(MediaType.TEXT_PLAIN, new TXTParser());
                } else if (name.equals("org.apache.jackrabbit.extractor.PngTextExtractor")) {
                    Parser parser = new ImageParser();
                    parsers.put(MediaType.image("png"), parser);
                    parsers.put(MediaType.image("apng"), parser);
                    parsers.put(MediaType.image("mng"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.RTFTextExtractor")) {
                    Parser parser = new RTFParser();
                    parsers.put(MediaType.application("rtf"), parser);
                    parsers.put(MediaType.text("rtf"), parser);
                } else if (name.equals("org.apache.jackrabbit.extractor.XMLTextExtractor")) {
                    Parser parser = new XMLParser();
                    parsers.put(MediaType.APPLICATION_XML, parser);
                    parsers.put(MediaType.text("xml"), parser);
                } else {
                    logger.warn("Ignoring unknown text extractor class: {}", name);
                }
            }
    
            parser.setParsers(parsers);
        }
    
        /**
         * Delegates the call to the configured {@link AutoDetectParser}.
         */
        public Set<MediaType> getSupportedTypes(ParseContext context) {
            return parser.getSupportedTypes(context);
        }
    
        /**
         * Delegates the call to the configured {@link AutoDetectParser}.
         */
        public void parse(
                InputStream stream, ContentHandler handler,
                Metadata metadata, ParseContext context)
                throws IOException, SAXException, TikaException {
            waitIfBlocked();
            parser.parse(stream, handler, metadata, context);
        }
    
        public void parse(
                InputStream stream, ContentHandler handler, Metadata metadata)
                throws IOException, SAXException, TikaException {
            parse(stream, handler, metadata, new ParseContext());
        }
    
        /**
         * Waits until text extraction is no longer blocked. The block is only
         * ever activated in the Jackrabbit test suite when testing delayed
         * text extraction.
         *
         * @throws TikaException if the block was interrupted
         */
        private synchronized static void waitIfBlocked() throws TikaException {
            try {
                while (blocked) {
                    JackrabbitParser.class.wait();
                }
            } catch (InterruptedException e) {
                throw new TikaException("Text extraction block interrupted", e);
            }
        }
    
        /**
         * Blocks all text extraction tasks.
         */
        static synchronized void block() {
            blocked = true;
        }
    
        /**
         * Unblocks all text extraction tasks.
         */
        static synchronized void unblock() {
            blocked = false;
            JackrabbitParser.class.notifyAll();
        }
    
    }

    具体的文本解析工作是通过委托给AutoDetectParser类来执行的,如果看过我以前的apache tika源码研究,就可以知道AutoDetectParser类继承自CompositeParser类,而CompositeParser类的处理方式是通过调用它的Parser聚集来完成具体的解析工作,这里面 实现的是composite模式(自顶向下的安全式的composite模式)

    ---------------------------------------------------------------------------

    本系列Apache Jackrabbit源码研究系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/03/2997156.html

  • 相关阅读:
    Vasya and Endless Credits CodeForces
    Dreamoon and Strings CodeForces
    Online Meeting CodeForces
    数塔取数 基础dp
    1001 数组中和等于K的数对 1090 3个数和为0
    1091 线段的重叠
    51nod 最小周长
    走格子 51nod
    1289 大鱼吃小鱼
    POJ 1979 Red and Black
  • 原文地址:https://www.cnblogs.com/chenying99/p/2997156.html
Copyright © 2011-2022 走看看