zoukankan      html  css  js  c++  java
  • 语言检测库

    5.3.3 检测文档的语言
    如果你不知道文档或查询的语言(大多时候是这样),可以使用语言检测软件,在一定程度
    上可以检测文档或查询的语言。如果使用Java,可以使用几个可用的语言检测库之一。比如下面
    这些:
     Apache Tika(http://tika.apache.org/);
     Language detection(http://code.google.com/p/language-detection/)。
    Language detection库声称支持53种语言并提供99%的准确度,可以说很多了。
    应该记住,文本越长语言检测越准确。然而,由于查询的文本通常很短,你可能会在查询语
    言识别过程中遇到一定程度的错误。

    Apache Tika - a content analysis toolkit

    The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. You can find the latest release on the download page. Please see the Getting Started page for more information on how to start using Tika.

    The Parser and Detector pages describe the main interfaces of Tika and how they work.

    If you're interested in contributing to Tika, please see the Contributing page or send an email to the Tika development list.

    Tika is a project of the Apache Software Foundation, and was formerly a subproject of Apache Lucene.

    The Parser interface

    The org.apache.tika.parser.Parser interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:

    void parse(
        InputStream stream, ContentHandler handler, Metadata metadata,
        ParseContext context) throws IOException, SAXException, TikaException;

    The parse method takes the document to be parsed and related metadata as input and outputs the results as XHTML SAX events and extra metadata. The parse context argument is used to specify context information (like the current local) that is not related to any individual document.

  • 相关阅读:
    js怪招(摘录篇)
    猪八戒实习笔记(工具总结)
    2014年最新前端开发面试题(面霸题库)
    经典算法:快排的Javascript版本
    IE的CSS相关的BUG(整理一)
    setTimeout()的返回值
    面试回忆录(三)
    面试回忆录(二)
    读取指定文件夹下的全部文件,可通过正则进行过滤,返回文件路径数组 -- 基于node的一个函数
    Backbone简单示例
  • 原文地址:https://www.cnblogs.com/rsapaper/p/9848968.html
Copyright © 2011-2022 走看看