zoukankan      html  css  js  c++  java
  • The Stanford NLP (Natural Language Processing) Group

    The Stanford NLP (Natural Language Processing) Group


    Stanford Word Segmenter


    Download |
    Mailing Lists |
    Release history

    Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.

    The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.

    The system requires Java 1.6+ to be installed. We recommend at least 1G of memory for documents that contain long sentences. For files with shorter sentences (e.g., 20 tokens), decrease the memory requirement by changing the option java -mx1g in the run scripts.

    Arabic

    Arabic is a root-and-template language with abundant bound morphemes. These morphemes include possessives, pronouns, and discourse connectives. Segmenting bound morphemes reduces lexical sparsity and simplifies syntactic analysis.

    The Arabic segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. It is a stand-alone implementation of the segmenter described in:

    Spence Green and John DeNero. 2012. A Class-Based Agreement Model for Generating Accurately Inflected Translations. In ACL.

    Chinese

    Chinese is standardly written without spaces between words (as are some
    other languages). This software will split Chinese text into a sequence
    of words, defined according to some word segmentation standard.
    It is a Java implementation of the CRF-based Chinese Word Segmenter
    described in:

    Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. 2005. A Conditional Random Field Word Segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing.

    Two models with two different segmentation standards are included:

    Chinese Penn Treebank standard
    and

    Peking University standard
    .

    On May 21, 2008, we released a version that makes use of lexicon
    features. With external lexicon features, the segmenter segments more
    consistently and also achieves higher F measure when we train and test
    on the bakeoff data. This version is close to the CRF-Lex segmenter described in:

    Pi-Chuan Chang, Michel Galley and Chris Manning. 2008. Optimizing Chinese Word Segmentation for Machine Translation Performance. In WMT.

    The older version (2006-05-11) without using external lexicon features
    will still be available for download, but we do recommend using the
    latest version.

    Another new feature of the latest release is that the segmenter can now output k-best segmentations.
    An example of how to train the segmenter
    is now also available.

    Download

    The segmenter is available for download,
    licensed under the GNU
    General Public License
    (v2 or later). Source is included.
    The package includes components for command-line invocation and a Java API.
    The segmenter
    code is dual licensed (in a similar manner to MySQL, etc.).
    Open source licensing is under the full GPL,
    which allows many free uses.
    For distributors of
    proprietary
    software
    , commercial licensing with a
    ready-to-sign
    agreement
    is available.
    If you don't need a commercial license, but would like to support
    maintenance of these tools, we welcome gift funding.

    The download is a zipped file consisting of
    model files, compiled code, and source files. If you unpack the tar file,
    you should have everything needed. Simple scripts are included to
    invoke the segmenter.

    Download
    Stanford Word Segmenter version 2012-11-11

    Mailing Lists

    We have 3 mailing lists for the Stanford Word Segmenter, all of which are shared
    with other JavaNLP tools (with the exclusion of the parser). Each address is
    at @lists.stanford.edu:

    1. java-nlp-user This is the best list to post to in order
      to ask questions, make announcements, or for discussion among JavaNLP
      users. You have to subscribe to be able to use it.
      Join the list via this webpage or by emailing
      java-nlp-user-join@lists.stanford.edu. (Leave the
      subject and message body empty.) You can also
      look at
      the list archives
      .
    2. java-nlp-announce This list will be used only to announce
      new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3
      message a year). Join the list via via this webpage or by emailing
      java-nlp-announce-join@lists.stanford.edu. (Leave the
      subject and message body empty.)
    3. java-nlp-support This list goes only to the software
      maintainers. It's a good address for licensing questions, etc. For
      general use and support questions, please join and use
      java-nlp-user.

      You cannot join java-nlp-support, but you can mail questions to
      java-nlp-support@lists.stanford.edu.
  • 相关阅读:
    根据NSString字符串长度自动改变UILabel的frame
    计算两个日期的天数问题
    iOS学习笔记(02)
    iOS学习笔记(01)
    iOS使用Swift语言检查并提示更新
    iOS的一些关键字
    一些常见warning的原因和解决方法
    Objective-C和Swift实现单例的几种方式
    与导航栏下控件的frame相关的edgesForExtendedLayout、translucent、extendedLayoutIncludesOpaqueBars、automaticallyAdjustsScrollViewInsets等几个属性的详解
    App常见崩溃问题分析
  • 原文地址:https://www.cnblogs.com/lexus/p/2778392.html
Copyright © 2011-2022 走看看