zoukankan      html  css  js  c++  java
  • How to generate a new dictionary file of mmseg

    How to generate a new dictionary file of mmseg

    0.Usage about mmseg-node memtioned in github :
    var mmseg = require("mmseg");
    var q = mmseg.open('/usr/local/etc/');
    console.log(q.segmentSync("我是中文分词"));

    #"/usr/local/etc" is dir of mmseg's dictionary, which has a file "uni.lib" , which is the directionary file

    1. so we need a generate directionary file. Before this , we need to install coreseek , ref to http://www.coreseek.cn/products-install/install_on_bsd_linux/
    安装前,建议查看:源码包说明README;4.0/4.1版可参考3.2版本安装,步骤相同;如遇到问题,请看详细安装说明。

    ##下载coreseek:coreseek 3.2.14:点击下载、coreseek 4.0.1:点击下载、coreseek 4.1:点击下载
    $ wget http://www.coreseek.cn/uploads/csft/3.2/coreseek-3.2.14.tar.gz
    $ 或者 http://www.coreseek.cn/uploads/csft/4.0/coreseek-4.0.1-beta.tar.gz
    $ 或者 http://www.coreseek.cn/uploads/csft/4.0/coreseek-4.1-beta.tar.gz
    $ tar xzvf coreseek-3.2.14.tar.gz 或者 coreseek-4.0.1-beta.tar.gz 或者 coreseek-4.1-beta.tar.gz
    $ cd coreseek-3.2.14 或者 coreseek-4.0.1-beta 或者 coreseek-4.1-beta

    ##前提:需提前安装操作系统基础开发库及mysql依赖库以支持mysql数据源和xml数据源
    ##安装mmseg
    $ cd mmseg-3.2.14
    $ ./bootstrap #输出的warning信息可以忽略,如果出现error则需要解决
    $ ./configure --prefix=/usr/local/mmseg3
    $ make && make install
    $ cd ..

    ##安装coreseek
    $ cd csft-3.2.14 或者 cd csft-4.0.1 或者 cd csft-4.1
    $ sh buildconf.sh #输出的warning信息可以忽略,如果出现error则需要解决
    $ ./configure --prefix=/usr/local/coreseek --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql ##如果提示mysql问题,可以查看MySQL数据源安装说明
    ##debian5 : ubuntu9/10 install mysql:
    $ apt-get install mysql-client libmysqlclient15-dev libxml2-dev libexpat1-dev

    $ make && make install
    $ cd ..

    ##测试mmseg分词,coreseek搜索(需要预先设置好字符集为zh_CN.UTF-8,确保正确显示中文)
    $ cd testpack
    $ cat var/test/test.xml #此时应该正确显示中文
    $ /usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/test.xml #we can see content in test.xml was divided in "system-default-knowed vocabulary" which base on dictionary file "/usr/local/mmseg3/etc/unilib".
    $ /usr/local/coreseek/bin/indexer -c etc/csft.conf --all #regenerate a index

    2.generate a new dictionary:
    #write the new vocabulary in word_new_input.txt, each vocabulary one line and cd in where you locate your word_new_input.txt
    #for example (no # at the beginning of each line):
    #雅阁
    #马自达

    # now you cd in your new vocabulary dir:
    $ cd ~/projects/mmseg-3.2.14/new2
    $ cat word_new_input.txt | awk '{print $1" ""1"" x:1"}' > word_new_gen.txt
    $ cat ../data/unigram.txt | word_new_gen.txt > word_new_gen.txt
    $ /usr/local/mmseg3/bin/mmseg -u word_new_gen.txt #which generate a word_new_gen.txt.lib file
    $ mv word_new_gen.txt.lib uni.lib #rename
    #$ cp /usr/local/mmseg3/etc ~/ -r #backup your dictionary file
    $ sudo cp uni.lib /usr/local/mmseg3/etc/ #replace the dictionary file with new one
    ## now you cd in your coreseek-3.2.14/testpack directory
    $ /usr/local/coreseek/bin/indexer -c ~/projects/coreseek-3.2.14/testpack/etc/csft.conf --all #regenerate a new index
    #above generate some output as the following:
    Coreseek Fulltext 3.2 [ Sphinx 0.9.9-release (r2117)]
    Copyright (c) 2007-2011,
    Beijing Choice Software Technologies Inc (http://www.coreseek.com)

    using config file 'etc/csft.conf'...
    indexing index 'xml'...
    collected 3 docs, 0.0 MB
    sorted 0.0 Mhits, 100.0% done
    total 3 docs, 7585 bytes
    total 0.010 sec, 746334 bytes/sec, 295.18 docs/sec
    total 2 reads, 0.000 sec, 4.2 kb/call avg, 0.0 msec/call avg
    total 7 writes, 0.000 sec, 3.1 kb/call avg, 0.0 msec/call avg

    #new dict store in /usr/local/mmseg3/etc/
    3.test the new dictionary:
    3.1 file "var/test/newtest.txt" is the one has new vocabulary sentence:
    $ /usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/newtest.txt
    雅阁/x 现在/x 卖/x 多少/x 钱/x ?/x
    马自达/x 的/x 重量/x 是/x 多少/x ?/x
    3.2 or you can program in coffee:

    david@Wade:~/node/node$ coffee
    coffee> mmseg=require('mmseg')
    { open: [Function],
    clean: [Function],
    uniq: [Function] }
    coffee> q= mmseg.open( '/usr/local/mmseg3/etc/')
    {}
    coffee> console.log q.segmentSync('我喜欢开雅阁')
    [ '我', '喜欢', '开', '雅阁' ]
    undefined
    coffee> console.log q.segmentSync('我喜欢开丰田') #丰田 is NOT in the new dictionary
    [ '我', '喜欢', '开', '丰', '田' ]
    undefined
    coffee> console.log q.segmentSync '我喜欢开马自达'
    [ '我', '喜欢', '开', '马自达' ]


  • 相关阅读:
    Leetcode Plus One
    Leetcode Swap Nodes in Pairs
    Leetcode Remove Nth Node From End of List
    leetcode Remove Duplicates from Sorted Array
    leetcode Remove Element
    leetcode Container With Most Water
    leetcode String to Integer (atoi)
    leetcode Palindrome Number
    leetcode Roman to Integer
    leetcode ZigZag Conversion
  • 原文地址:https://www.cnblogs.com/no7dw/p/3553911.html
Copyright © 2011-2022 走看看