zoukankan      html  css  js  c++  java
  • 关于日语分词Japanese segmenter

    1.主要JapaneseTokenizer

    https://pypi.org/project/JapaneseTokenizer/

    安装: pip install JapaneseTokenizer

    Supported Tokenizers

    1.1 Mecab

    安装请参考:https://www.dazhuanlan.com/2020/02/13/5e45085eac4da/

    安裝 MeCab

    1. 首先先下載Mecab Source(mecab-0.996.tar.gz)
    2. 再來開啟 Terminal ,先創建一個資料夾

      

    $ sudo mkdir /usr/local/mecab

      3. 解壓縮、設置、編譯、安裝

    $ cd $HOME/Downloads
    $ tar xvfz mecab-0.996.tar.gz
    $ cd mecab-0.996
    $ ./configure --enable-utf8-only --prefix=/usr/local/mecab
    $ make
    $ sudo make install

    安裝 IPA 辭典

    IPA 辭典, 基於 IPA 語料庫, 使用 CRF 進行參數估計的詞典(必安裝)

    1. 首先先下載IPA Source(mecab-ipadic-2.7.0-20070801.tar.gz)
    2. 解壓縮、設置、編譯、安裝
    $ cd $HOME/Downloads
    $ tar xvfz mecab-ipadic-2.7.0-20070801.tar.gz
    $ cd mecab-ipadic-2.7.0-20070801
    $ ./configure --prefix=/usr/local/mecab --with-mecab-config=/usr/local/mecab/bin/mecab-config --with-charset=utf8
    $ make
    $ sudo make install

    示例

    export PATH=/usr/local/mecab/bin:$PATH

    import JapaneseTokenizer
    input_sentence = '10日放送の「中居正広のミになる図書館」(テレビ朝日系)で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。'
    # ipadic is well-maintained dictionary #
    mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='ipadic')
    print(mecab_wrapper.tokenize(input_sentence).convert_list_object())
    
    # neologd is automatically-generated dictionary from huge web-corpus #
    mecab_neologd_wrapper = JapaneseTokenizer.MecabWrapper(dictType='neologd')
    print(mecab_neologd_wrapper.tokenize(input_sentence).convert_list_object())
  • 相关阅读:
    ACM学习历程—HDU1719 Friend(数论)
    封装的方法
    MySql-rules
    MySql
    java深入探究07-jsp
    java深入探究06
    Jquery
    Ajax
    java深入探究05
    Oracle——索引,序列,触发器
  • 原文地址:https://www.cnblogs.com/lingwang3/p/14424336.html
Copyright © 2011-2022 走看看