CRF++中文分词使用指南

zoukankan html css js c++ java

CRF++中文分词使用指南
http://blog.csdn.net/marising/article/details/5769653

前段时间写了中文分词的一些记录里面提到了CRF的分词方法，近段时间又研究了一下，特把方法写下来，以备忘，另外，李沫南同学优化过CRF++，见：http://www.coreseek.cn/opensource/CRF/。我觉得CRF++还有更大的优化空间，以后有时间再搞。

人民日报语料是分好词的，我下面贴出的代码就是把语料整理为CRF需要的训练数据，直接修改模板训练即可。不过有下面的同学给出了更加详细的资料，请各位可以参考：

原始程序确实有一些问题，估计作者没时间修正了，我修改好了楼主的程序并分词成功，并上传了代码和处理好的训练、测试语料：

http://x-algo.cn/index.php/2016/02/27/crf-of-chinese-word-segmentation/

1 下载和安装

CRF的概念，请google，我就不浪费资源啦。官方地址如下：http://crfpp.sourceforge.net/

我用的是Ubutnu，所以，下载的是源码：http://sourceforge.net/projects/crfpp/files/ 下载CRF++-0.54.tar.gz

没有gcc/g++/make请安装
% ./configure
% make
% sudo make install

2 测试和体验
在源码包中有example，可以执行./exec.sh体验一下
exec.sh #训练和测试脚本
template #模板文件
test.data #测试文件
train.data #训练文件
可以打开看看

3 语料整理和模板编写

我采用的是6Tag和6Template的方式
S，单个词；B，词首；E，词尾；M1/M2/M，词中

1个字的词：
和 S
2个字的词(注意是实际上是一个字一行，我为了排版，改为横排的了)：
中 B 国 E
3个字的词：
进 B 一 M 步 E
5个字的词：
发 B 展 M1 中 M2 国 M 家 E
跟多字的词
中 B 华 M1 人 M2 民 M 共 M 和 M国 E
标点符号作为单词（S表示）

从bamboo 项目中下载：people-daily.txt.gz
pepoledata.py文件
[python] view plain copy

#!/usr/bin/<a href="http://lib.csdn.net/base/11" class='replace_word' title="undefined" target='_blank' style='color:#df3434; font-weight:bold;'>Python</a>



# -*- coding: utf-8 -*-







import sys







#home_dir = "D:/source/NLP/people_daily//"



home_dir = "/home/lhb/workspace/CRF_data/"

def splitWord(words):

    uni = words.decode('utf-8')

    li = list()

    for u in uni:

    li.append(u.encode('utf-8'))

    return li





#4 tag



#S/B/E/M

def get4Tag(li):

    length = len(li)

    #print length

    if length   == 1:

    return ['S']

    elif length == 2:

    return ['B','E']

    elif length > 2:

    li = list()

    li.append('B')

    for i in range(0,length-2):

        li.append('M')

    li.append('E')

    return li

#6 tag

#S/B/E/M/M1/M2

def get6Tag(li):

    length = len(li)

    #print length

    if length   == 1:

    return ['S']

    elif length == 2:

    return ['B','E']

    elif length == 3:

    return ['B','M','E']

    elif length == 4:

    return ['B','M1','M','E']

    elif length == 5:

    return ['B','M1','M2','M','E']

    elif length > 5:

    li = list()

    li.append('B')

    li.append('M1')

    li.append('M2')

    for i in range(0,length-4):

        li.append('M')

    li.append('E')

    return li



def saveDataFile(trainobj,testobj,isTest,word,handle,tag):

    if isTest:

    saveTrainFile(testobj,word,handle,tag)

    else:

    saveTrainFile(trainobj,word,handle,tag)



def saveTrainFile(fiobj,word,handle,tag):

    if len(word) > 0:

    wordli = splitWord(word)

    if tag == '4':

        tagli = get4Tag(wordli)

    if tag == '6':

        tagli = get6Tag(wordli)

    for i in range(0,len(wordli)):

        w = wordli[i]

        h = handle

        t = tagli[i]

        fiobj.write(w + '/t' + h + '/t' + t + '/n')

    else:

    #print 'New line'

    fiobj.write('/n')



#B,M,M1,M2,M3,E,S

def convertTag(tag):

    fiobj    = open( home_dir + 'people-daily.txt','r')

    trainobj = open( home_dir + tag + '.train.data','w' )

    testobj  = open( home_dir + tag + '.test.data','w')



    arr = fiobj.readlines()

    i = 0

    for a in arr:

    i += 1

    a = a.strip('/r/n/t ')

    words = a.split(' ')

    test = False

    if i % 10 == 0:

        test = True

    for word in words:

        word = word.strip('/t ')

        if len(word) > 0:

        i1 = word.find('[')

        if i1 >= 0:

            word = word[i1+1:]

        i2 = word.find(']')

        if i2 > 0:

            word = word[:i2]

        word_hand = word.split('/')

        w,h = word_hand

        #print w,h

        if h == 'nr':    #ren min

            #print 'NR',w

            if w.find('·') >= 0:

            tmpArr = w.split('·')

            for tmp in tmpArr:

                saveDataFile(trainobj,testobj,test,tmp,h,tag)

            continue

        if h != 'm':

            saveDataFile(trainobj,testobj,test,w,h,tag)



        if h == 'w':

            saveDataFile(trainobj,testobj,test,"","",tag) #split



    trainobj.flush()

    testobj.flush()



if __name__ == '__main__':

    if len(sys.argv) < 2:

    print 'tag[6,4] convert raw data to train.data and tag.test.data'

    else:

    tag = sys.argv[1]

    convertTag(tag)
下载下来并解压，然后用脚本整理数据，注意home_dir改为语料的目录：
python ./peopledata.py 6

90%数据作为训练数据，10%的数据作为测试数据，生成的文件如：
6.test.data
6.train.data

模板文件的写法如下
template：
[python] view plain copy

# Unigram

U00:%x[-1,0]

U01:%x[0,0]

U02:%x[1,0]

U03:%x[-1,0]/%x[0,0]

U04:%x[0,0]/%x[1,0]

U05:%x[-1,0]/%x[1,0]



# Bigram

B
%x[row,column]代表的是行和列，[-1,0]表示前1个字的第1列，［0，0］当前字的第1列，［1，0］后1个字的第1列

4 执行和结果查看
6exec.sh文件
[python] view plain copy

#!/bin/sh

./crf_learn -f 3 -c 4.0 template 6.train.data 6.model > 6.train.rst

./crf_test -m 6.model 6.test.data > 6.test.rst

./crfeval.py 6.test.rst



#./crf_learn -a MIRA -f 3 template train.data model

#./crf_test -m model test.data

#rm -f model
WordCount from test result: 109805
WordCount from golden data: 109948
WordCount of correct segs : 106145
P = 0.966668, R = 0.965411, F-score = 0.966039

5 调整Tag和模板
4 Tag S/B/M/E 比 6Tag 去掉了M1和M2
python ./peopledata.py 4
4exec.sh文件为
[python] view plain copy

#!/bin/sh

./crf_learn -f 3 -c 4.0 template 4.train.data 4.model > 4.train.rst

./crf_test -m 4.model 4.test.data > 4.test.rst

./crfeval.py 4.test.rst
4Tag的效果为
lhb@localhost:~/workspace/CRF_data$ ./crfeval.py 4.test.rst
ordCount from test result: 109844
WordCount from golden data: 109948
WordCount of correct segs : 105985
P = 0.964868, R = 0.963956, F-score = 0.964412

6Tag的效果比4Tag有细微的差距，当然是6Tag好。

6Tag 训练时间为
10062.00s
4tag的训练时间为
4208.71s

6Tag的标注方法差异

1)把M放在E之前：
发 B 展 M1 中 M2 国 M 家 E
2)把M放在B后
发 B 展 M 中 M1 国 M2 家 E
3)把M放在M1和M2之间：
发 B 展 M1 中 M 国 M2 家 E
第1种方式效果最好，有细微的差距。
template的编写

我尝试过12行模板的编写，把词性作为一个计算因素，但是速度实在是很慢，没跑完，我就关机了。效果应该比6 template要好，可以尝试以下。
[python] view plain copy

# Unigram

U00:%x[-1,1]

U01:%x[0,1]

U02:%x[1,1]

U03:%x[-1,1]/%x[0,1]

U04:%x[0,1]/%x[1,1]

U05:%x[-1,1]/%x[1,1]

U06:%x[-1,0]

U07:%x[0,0]

U08:%x[1,0]

U09:%x[-1,0]/%x[0,0]

U010:%x[0,0]/%x[1,0]

U011:%x[-1,0]/%x[1,0]



# Bigram

B
有某位同学问我要crfeval.py文件，特放出如下：
[python] view plain copy

#!/usr/bin/python

# -*- coding: utf-8 -*-



import sys



if __name__=="__main__":

    try:

        file = open(sys.argv[1], "r")

    except:

        print "result file is not specified, or open failed!"

        sys.exit()



    wc_of_test = 0

    wc_of_gold = 0

    wc_of_correct = 0

    flag = True



    for l in file:

        if l=='/n': continue



        _, _, g, r = l.strip().split()



        if r != g:

            flag = False



        if r in ('E', 'S'):

            wc_of_test += 1

            if flag:

                wc_of_correct +=1

            flag = True



        if g in ('E', 'S'):

            wc_of_gold += 1



    print "WordCount from test result:", wc_of_test

    print "WordCount from golden data:", wc_of_gold

    print "WordCount of correct segs :", wc_of_correct



    #查全率

    P = wc_of_correct/float(wc_of_test)

    #查准率，召回率

    R = wc_of_correct/float(wc_of_gold)



    print "P = %f, R = %f, F-score = %f" % (P, R, (2*P*R)/(P+R))
查看全文

相关阅读:
IDirect3DDevice9::Clear
Width vs Pitch
5- vue django restful framework 打造生鲜超市 -完成商品列表页(上)
4- vue django restful framework 打造生鲜超市 -restful api 与前端源码介绍
 3- vue django restful framework 打造生鲜超市
 2- vue django restful framework 打造生鲜超市 -环境搭建
 1- vue django restful framework 打造生鲜超市
 Scrapy分布式爬虫打造搜索引擎- (二)伯乐在线爬取所有文章
 windows10上安装mysql
博客开通第一天

原文地址：https://www.cnblogs.com/DjangoBlog/p/6207617.html