zoukankan html css js c++ java

LibShortText 怎么处理中文文档

LibShortText 是林智仁老师继 libsvm、liblinear 之后的另一力作，主要有几大特征：

It is more efficient than general text-mining packages. On a typical computer, processing and training 10 million short texts takes only around half an hour.
The fast training and testing is built upon the linear classifier LIBLINEAR
Default options often work well without tedious tuning.
An interactive tool for error analysis is included. Based on the property that each short text contains few words, LibShortText provides details in predicting each text.

这么一个工具，如何使用在中文处理呢？
尝试了一下中文的unigram feature的自动生成，发现中文character 没有count进unigram中=。=

于是我发信问了作者
作者回复：

Unfortunately I don't think our code can now support Chinese
documents.
Chih-Jen

但是，这完全是水平有限，对python不熟悉的后果，http://guoze.me/2014/09/25/libshorttext-introduction/ 这个链接里面，作者提出可以自己定制中文分词器来使程序work在中文上。

查看全文

相关阅读:
ArcObject获取ArcMap默认地理数据库的路径
 标准IO
进程关系
 进程环境
 C语言基础知识位运算
 Bash 快捷键
 信号
 UNIX系统文件
 进程
 unix 文件属性

原文地址：https://www.cnblogs.com/zklidd/p/4079668.html