zoukankan      html  css  js  c++  java
  • MSTParser句法工具使用方法

    http://blog.csdn.net/hellonlp/article/details/7694284

    注意:源代码必须使用 处的源代码

    最近做实验用到依存句法分析。找依存工具找的头都大的。

    standford parser的依存分析感觉有点不好,有的结点的父节点不止一个。貌似和严格的依存分析定义有冲突。所以没有用standford的工具。

    minipar 依存分析,感觉很好用的工具,但是工具没有及时更新,他的demo中的makefile文件都弄不好。网上解决方法也找不到多少。没有办法,只好放弃。

    最后,找到MSTParser这个工具。不得不说,这玩意在百度结果中还真少。

    MSTParser是一个开源的句法分析工具,java写的。可以在linux window下运行。

    工具下载地址: http://sourceforge.net/projects/mstparser/

    工具说明:http://www.seas.upenn.edu/~strctlrn/MSTParser/MSTParser.html

                   http://www.seas.upenn.edu/~strctlrn/MSTParser/README

    先看看工具的说明目录:

    ----------------
    Contents
    ----------------
    
    1. Compiling
    
    2. Example of usage
    
    3. Running the parser
       a. Input data format
       b. Training a parser
       c. Running a trained model on new data
       d. Evaluating output
    
    4. Memory/Disk space and performance issues
    
    5. Reproducing results in HLT-EMNLP and ACL papers

    1,编译

    2,用法示例

    3,运行分析器

         a.输入数据格式

         b.训练分析器

         c.使用训练模型分析新数据

         d.结果测评

    4,内存/磁盘空间和性能问题

    5,关于其的论文

     

    现在可以在说明下编译文件了。

    ----------------
    1. Compiling
    ----------------
    
    To compile the code, first unzip/tar the downloaded file:
    
    > gunzip mstparser.tar.gz
    > tar -xvf mstparser.tar
    > cd MSTParser
    Next, run the following command > javac -classpath ".:lib/trove.jar" mstparser/DependencyParser.java This will compile the package.

     注意:

    在windows下面的命令应该是:

    > cd c:\java\MSTParser

     ".;lib/trove.jar" mstparser/DependencyParser.java

    原命令中间的冒号在windows下改为分号。(没编译好查看下自己的java的版本 说明文件说工具在java1.4 1.5下使用)

    ok,现在mstparser文件夹下面的.java文件都应该被编译,生成了.class文件。

    继续 

     

    说明文档给出 使用示例 

    2. Example Usage
    ---------------------
    
    In the directory data/ there are examples of training and testing data. Data
    format is described in the next section.
    
    train.ulab/test.ulab
    - training and testing data with unlabeled trees
    
    train.lab/test.lab
    - training and testing data with labeled trees
    
    To run an unlabeled parser type:
    
    > java -classpath ".:lib/trove.jar" -Xmx1800m mstparser.DependencyParser \
      train train-file:data/train.ulab model-name:dep.model \
      test test-file:data/test.ulab output-file:out.txt \
      eval gold-file:data/test.ulab
      
    This will train a parser on the training data, run it on the testing data and
    evaluate the output against the gold standard. The results from running the
    parser are in the file out.txt and the trained model in dep.model.
    
    To train an labeled parser run the same command but use the labeled training
    and testing files.
    
     

    主要关注命令:

     > java -classpath ".:lib/trove.jar" -Xmx1441m mstparser.DependencyParser \

        train train-file:data/train.ulab model-name:dep.model \

        test test-file:data/test.ulab output-file:out.txt \

      eval gold-file:data/test.ulab

         ( zcl:其中

              java -classpath ".:lib/trove.jar" -Xmx1800m mstparser.DependencyParser改为
              java -classpath ".:lib/trove.jar" -Xmx1441m mstparser.DependencyParser

              原因参见 http://webservices.ctocio.com.cn/java/315/9330315.shtml)

    训练命令
    java -classpath ".;lib/trove.jar" -Xmx1441m mstparser/DependencyParser train train-file:data/train.ulab model-name:dep.model

    测试命令
    java -classpath ".;lib/trove.jar" -Xmx1441m mstparser/DependencyParser test test-file:data/zcltest.ulab output-file:out.txt eval gold-file:data/zcltest.ulab

     这里面

     train train-file:data/train.ulab model-name:dep.model \

    训练使用的文件  data文件夹下面的 train.ulab  训练出来的模型放在 dep.model

    test test-file:data/test.ulab output-file:out.txt \

    测试使用的文件 data文件夹下面的test.ulab( ***注意 训练文件的格式需要注意 后面会提及的)   得到的输出放在 out.txt文件中

      eval gold-file:data/test.ulab

    评测的答案 data文件夹下面的test.ulab(就是刚才的测试的文件)

     

    上述命令使用的是ulab文件(应该翻译为 未标注的文件),同时命令可以使用在data文件夹下面的lab文件(标注文件)上。

    (*************标注文件和未标注文件 后面文件格式中会说明)

     

    继续

    -------------------------
    3. Running the Parser
    -------------------------

    开始要使用工具了。

    -------------------------
    3a. Input data format
    -------------------------
    
    Example data sets are given in the data/ directory.
    
    Each sentence in the data is represented by 3 or 4 lines and sentences are
    space separated. The general format is:
    
    w1    w2    ...    wn
    p1    p2    ...    pn
    l1    l2    ...    ln
    d1    d2    ...    d2
    
    ....
    
    
    Where,
    - w1 ... wn are the n words of the sentence (tab deliminated)
    - p1 ... pn are the POS tags for each word
    - l1 ... ln are the labels of the incoming edge to each word
    - d1 ... dn are integers representing the postition of each words parent
    
    For example, the sentence "John hit the ball" would be:
    
    John	hit	the	ball
    N	V	D	N
    SBJ	ROOT	MOD	OBJ
    2	0	4	2
    
    Note that hit's parent is indexed by 0 since it is the root.
    
    If you wish to only train or test an unlabeled parser, then simply leave out
    the third line for each sentence, e.g.,
    
    John	hit	the	ball
    N	V	D	N
    2	0	4	2
    
    The parser will automatically detect that it should produce unlabeled trees.
    
    Note that this format is the same for training AND for running the parser on
    new data. Of course, you may not always know the gold standard. In this case,
    just substitute lines 3 (the edge labels) and lines 4 (the parent indexes) with
    dummy values. The parser just ignores these values and produces its own.
    
     

    上面就是写的数据输入格式

    John	hit	the	ball
    N	V	D	N
    SBJ	ROOT	MOD	OBJ
    2	0	4	2

    第一行是 每个单词 

    第二行是 每个词的 词性

    第三行是 依存弧上的属性关系

    第四行是 父节点的位置(0表示其为根节点  ,John下的2 表示John节点的父亲是2号位置的hit)

    上面是lab文件的格式,如果不要第三性的信息就成为ulab文件的格式。

    注意: 这里训练和测试文件都需要这样的格式。这里很奇怪,其他的分析器都只需要输入句子(即第一行的信息就行),这里还需要很多其他信息。

    如果目的就是做依存分析,后面介绍。训练文件选择train.lab文件的话(训练和测试文件的格式需要一致,要么是四行的格式 lab文件格式,要么都是三行的格式 ulab格式),输入的test文件 格式如下:

     John hit the ball

    N	V	D	N
    LAB	LAB	LAB	LAB
    0	0	0	0

    第三行第四行分别是LAB 0 

    *******注意  中间的分隔符是 "\t" 不是空格。空格会报错。(如果疑问可以查看源文件 DependencyParser.java)

    ----------------------------
    3b. Training the parser
    ----------------------------
    
    If you have a set of labeled data, first place it in the format described
    above.
    
    If your training data is in a file train.txt, you can then run the command:
    
    > java -classpath ".:lib/trove.jar" -Xmx1800m mstparser.DependencyParser \
      train train-file:train.txt
    
    This will train a parser with all the default properties. Additonal
    properties can be described with the following flags:
    
    train
    - if present then parser will train a new model
    
    train-file:file.txt
    - use data in file.txt to train the parser
    
    model-name:model.name
    - store trained model in file called model.name
    
    training-iterations:numIters
    - Run training algorithm for numIters epochs, default is 10
    
    decode-type:type
    - type is either "proj" or "non-proj", e.g. decode-type:proj
    - Default is "proj"
    - "proj" use the projective parsing algorithm during training
      - i.e. The Eisner algorithm
    - "non-proj" use the non-projective parsing algorithm during training
      - i.e. The Chu-Liu-Edmonds algorithm
    
    training-k:K
    - Specifies the k-best parse set size to create constraints during training
    - Default is 1
    - For non-projective parsing algorithm, k-best decoding is approximate
    
    loss-type:type
    - type is either "punc" or "nopunc", e.g. loss-type:punc
    - Default is "punc"
    - "punc" include punctuation in hamming loss calculation
    - "nopunc" do not include punctuation in hamming loss calculation
    
    create-forest:cf
    - cf is either "true" or "false"
    - Default is "true"
    - If create-forest is false, it will not create the training parse forest (see
      section 4). It assumes it has been created.
    - This flag is useful if you are training many models on the same data and
      features but using different parameters (e.g. training iters, decoding type).
    
    order:ord
    - ord is either 1 or 2
    - Default is 1
    - Specifies the order/scope of features. 1 only has features over single edges
      and 2 has features over pairs of adjacent edges in the tree.
    
    
    ------------------------------------------------
    3c. Running a trained model on new data
    ------------------------------------------------
    
    This section assumes you have trained a model and it is stored in dep.model.
    
    First, format your data properly (section 3a).
    
    It should be noted that the parser assumes both words and POS tags. To
    generate POS tags for your data I suggest using the Ratniparkhi POS tagger
    or another tagger of your choice.
    
    The parser also assumes that the edge label and parent index lines are
    in the input. However, these can just be artificially inserted (e.g. with lines
    of "LAB ... LAB" and "0 ... 0") since the parser will produce these lines
    as output.
    
    If the data is in a file called test.txt, run the command:
    
    > java -classpath ".:lib/trove.jar" -Xmx1800m mstparser.DependencyParser \
      test model-name:dep.model test-file:test.txt output-file:out.txt
    
    This will create an output file "out.txt" with the predictions of the parser.
    Other properties can be defined with the following flags:
    
    test
    - If included a trained parser will be run on the testing data
    
    test-file:file.txt
    - The file containing the data to run the parser on
    
    model-name:model.name
    - The name of the stored model to be used
    
    output-file:out.txt
    - The result of running the parser on the new data
    
    decode-type:type
    - See section 3b.
    
    order:ord
    - See section 3b. THIS NEEDS TO HAVE THE SAME VALUE OF THE TRAINED MODEL!!
    
    Note that if you train a labeled model, you should only run it expecting
    labeled output (e.g. the test data should have 4 lines per sentence).
    And if you train an unlabeled model, you should only run it expecting
    unlabeled output (e.g. the test data should have 3 lines per sentence).
    
    
    ------------------------
    3d. Evaluating Output
    ------------------------
    
    This section describes a simple class for evaluating the output of
    the parser against a gold standard.
    
    Assume you have a gold standard, say test.txt and the output of the parser
    say out.txt, then run the following command:
    
    > java -classpath ".:lib/trove.jar" -Xmx1800m mstparser.DependencyParser \
      eval gold-file:test.txt output-file:out.txt
    
    This will return both labeled and unlabeled accuracy (if the data sets contain
    labeled trees) as well as complete sentence accuracy, again labeled and
    unlabeled.
    
    We should note that currently this evaluation script includes all punctuation.
    In future releases we will modify this class to allow for the evaluation to
    ingnore punctuation, which is standard for English (Yamada and Matsumoto 03).
    
    

    这边都是分开的命令了。分别是 训练模型 使用模型跑数据 测试结果。命令比较简单。不费口舌的。

     

    以上就是这次实验用到的MSTParser工具。

    注意事项就是:1,windows下的命令需要变一下

                           2,训练的数据需要自己标注词性并组织格式

                           3,没有答案的情况下需要写入LAB 0 并且所有的分隔符都是"\t"

  • 相关阅读:
    git速度慢解决方式
    idea破解
    高并发redis分布式锁
    一个简单的struts2项目
    一个简单的java项目使用hibernate连接mysql数据库
    hibernate连接数据库中文乱码
    删除文件时出现找不到该项目 请确认该项目位置
    redis错误解决
    pyQT4和pyQT5的主要模块介绍
    python之type函数
  • 原文地址:https://www.cnblogs.com/carl2380/p/3056491.html
Copyright © 2011-2022 走看看