zoukankan      html  css  js  c++  java
  • SENNA

    SENNA

    SENNA

    SENNA is a software distributed under a non-commercial license, which outputs a host of Natural Language Processing (NLP) predictions: part-of-speech (POS) tags, chunking (CHK), name entity recognition (NER), semantic role labeling (SRL) and syntactic parsing (PSG).

    SENNA is fast because it uses a simple architecture, self-contained because it does not rely on the output of existing NLP system, and accurate because it offers state-of-the-art or near state-of-the-art performance.

    SENNA is written in ANSI C, with about 3500 lines of code. It requires about 200MB of RAM and should run on any IEEE floating point computer.

    Proceed to the download page. Read the compilation section in you want to compile SENNA yourself. Try out a sanity check. And read about the usage.

    New in SENNA v3.0 (August 2011)

    Here are the main changes compared to SENNA v2.0:

    • Syntactic parsing.
    • We now include our original word embeddings, used to trained each task.
    • Bug correction: now outputs correctly tokens made of numbers (instead of replacing numbers by "0").
    • Option -offsettags, which outputs start/end offsets (in the sentence) of each token.

    DISCLAIMER: Our word embeddings differ from Joseph Turian's embeddings (even though it is unfortunate they have been called "Collobert & Weston embeddings" in several papers). Our embeddings have been trained for about 2 months, over Wikipedia.

    Details

    SENNA's details concerning POS, CHK, NER and SRL tasks are included in a JMLR paper. Later, the techniques have been extended and applied to syntactic parsing (PSG), and published in a AISTATS paper. If you use SENNA, please consider citing these appropriate papers.

    R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa. Natural Language Processing (Almost) from Scratch, Journal of Machine Learning Research (JMLR), 2011.

    R. Collobert. Deep Learning for Efficient Discriminative Parsing, in International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.

    Download

    SENNA sources are provided so you can adapt SENNA to your needs (under our license constraints), and assess its simplicity.

    We also provide binaries for a couple of platforms, in the same archive. The name of the executable for each platform is given as follow:

    • Linux 64 bits: senna-linux64
    • Windows 32 bits: senna-win32.exe
    • Mac OS X Snow Leopard Intel 64 bits: senna-osx

    Everything is included in a single tar-gzipped file (185MB). Proceed to the download page.

    Compilation

    Compiling SENNA is straightforward, as it is written in ANSI C and does not require external libraries. For speed it is however recommended to use the Intel MKL library.

    Linux

    In Linux/Unix/MacOS X systems, use gcc compiler:
    gcc -o senna -O3 -ffast-math *.c
    
    You might want to add additional suitable optimization flags for your platform. SENNA also compiles fine with the Intel compiler (icc).

    If speed is critical, we recommend to compile SENNA with the Intel MKL library, which provides a very efficient BLAS. Add the definition USE_MKL_BLAS, as well as correct MKL libraries and include path.

    gcc -o senna -O3 -ffast-math *.c -DUSE_MKL_BLAS [...]
    

    SENNA also compiles with ATLAS BLAS. On our platform, the handcrafted code compiled with the gcc command line shown above was faster. However, if you want to use it, you can compile it with:

    gcc -o senna -O3 -ffast-math *.c -DUSE_ATLAS_BLAS [...]
    

    Mac OS X

    Assuming you installed the XCode tools (which are provided on the Mac OS X DVD/CDs), simply compile with gcc:
    gcc -o senna -O3 -ffast-math *.c -DUSE_APPLE_BLAS -framework Accelerate
    
    This will compile against Apple BLAS libraries included in your system. As for Linux, it is recommended to use Intel MKL library instead of Apple BLAS libraries. The following command line can be invoked (replacing the dots by the correct library and include paths):
    gcc -o senna -O3 -ffast-math *.c -DUSE_MKL_BLAS [...]
    

    Windows

    SENNA compiles fine under Windows. You will have to create a Win32 console project under Microsoft Visual Studio (you can download the Express Edition). Add all the includes and C file into the project, and build the solution. You can also use the command line (after opening a Visual Studio Command Prompt) in the following way:
    cl /O2 /Fesenna.exe *.c
    

    We recommend to use Intel MKL for speed. See your MKL manual for adding proper libraries and includes. Add also the preprocessor definition USE_MKL_BLAS in the project. Using the command line, it would be:

    cl /O2 /Fesenna.exe *.c /DUSE_MKL_BLAS [...]
    

    Sanity Check

    Run in a console the following command:
    senna < sanity-test-input.txt > sanity-test-result.txt
    
    SENNA should create a file sanity-test-result.txt which should be identical to the provided sanity-test-output.txt file.

    The file sanity-test-input.txt comes from the CoNLL 2000 chunking testing set. SENNA will output all tags for this file. It should run in about 90 seconds on a decent computer (using MKL).

    Usage

    SENNA reads input sentences from the standard input and outputs tags into the standard output. The most likely command line usage for SENNA is therefore:
    senna [options] < input.txt > output.txt
    
    Of course you can run SENNA in an interactive mode without the "pipes" < and >.

    Each input line is considered as a sentence. SENNA has its own tokenizer for separating words, which can be deactivated with the -usrtokens option.

    SENNA outputs one line per "token", with all the corresponding tags (in IOBES format) on the same line. An empty line is inserted between each output sentence. The first column is the token. Tags for all task then follow by default (POS, CHK, NER and SRL). Tags for SRL are preceded by a column which indicates if SENNA considered the token as a SRL verb or not ("-"). Then, there is one column per SRL verb.

    SENNA supports the following options:

    • -h
      Display an inline help.
    • -verbose
      Display model informations (on the standard error output, so it does not mess up the tag outputs).
    • -notokentags
      Do not output tokens (first output column).
    • -offsettags
      Output start/end character offset (in the sentence), for each token.
    • -iobtags
      Output IOB tags instead of IOBES.
    • -brackettags
      Output 'bracket' tags instead of IOBES.
    • -path <path>
      Specify the path to the SENNA data/ and hash/ directories, if you do not run SENNA in its original directory. The path must end by "/".
    • -usrtokens
      Use user's tokens (space separated) instead of SENNA tokenizer.
    • -posvbs
      Use verbs outputed by the POS tagger instead of SRL style verbs for SRL task. You might want to use this, as the SRL training task ignore some verbs (many "be" and "have") which might be not what you want.
    • -usrvbs <file>
      Use user's verbs (given in <file>) instead of SENNA verbs for SRL task. The file must contain one line per token, with an empty line between each sentence. A line which is not a "-" corresponds to a verb.
    • -pos
      -chk
      -ner
      -srl
      -psg

      Instead of outputing tags for all tasks, SENNA will output tags for the specified (one or more) tasks.

    Remarks

    SENNA does not handle -LRB-, -RRB-, ... tokens. Please, replace these tokens in your input text by the appropriate (, ), .... Not replacing these tokens will have an impact on performance (for e.g., POS accuracy goes down, from 97.29% to 97.00%).

    Performance

    We report here SENNA performance in per-word accuracy for POS, and F1 score for all the other tasks. Timing corresponds to the time needed by SENNA to pass over the given test data set (Macbook Pro i7, 2.8GHz, Intel MKL). For PSG, F1 score is the one over all sentences (for sentences with less than 40 words, we get 88.5%).

    Task Benchmark   Performance Timing (s)
    Part of Speech (POS)(Toutanova et al, 2003)(Accuracy)97.29%3
    Chunking (CHK)CoNLL 2000(F1)94.32%2
    Name Entity Recognition (NER)CoNLL 2003(F1)89.59%2
    Semantic Role Labeling (SRL)CoNLL 2005(F1)75.49%36
    Syntactic Parsing (PSG)Penn Treebank(F1)87.92%74

    Old version:

    Feedback

    Please email to Ronan Collobert for any problem report or positive feedback. We will be glad to hear from you.

  • 相关阅读:
    php查找字符串中的http地址,并转换
    mojoPortalprovider模式学习之1.1 IndexBuilderConfiguration
    mojoportal学习笔记之一
    写博客了
    mojoPortal学习笔记之IndexBuilderProvider
    mojoportal学习笔记之IIndexableContent接口
    mojoportal学习笔记之How to Write a Provider Model
    flash图片新闻(源码)
    GridView控件日期格式化
    改变自己,学会调整自己,保持美好工作心情!
  • 原文地址:https://www.cnblogs.com/lexus/p/2966343.html
Copyright © 2011-2022 走看看