sklearn 中的Countvectorizer/TfidfVectorizer保留长度小于2的字符方法 - 走看看

zoukankan html css js c++ java

sklearn 中的Countvectorizer/TfidfVectorizer保留长度小于2的字符方法
在sklearn中的sklearn.feature_extraction.text.Countvectorizer()或者是sklearn.feature_extraction.text.TfidfVectorizer()中其在进行却分token的时候，会默认把长度<2的字符抛弃，例如下面的例子：
```
x = ['i love you', 'i hate you', 'i']
vect = CountVectorizer(min_df=0)
x_train = vect.fit_transform(x)
x_train.toarray()
```
- 1
- 2
- 3
- 4
其执行后的编码如下：

那么如果我们想要保留‘I’这种长度只有1的字符该怎么办呢？具体方法如下：
我么你可以指定最小的df，并且指定切分单词的模式，具体的例子：
```
x = ['i love you', 'i hate you', 'i']
vect = CountVectorizer(min_df=0, token_pattern='w+')
x_train = vect.fit_transform(x)
x_train.toarray()
```
- 1
- 2
- 3
- 4
运行结果：
查看全文

相关阅读:
sobel
构造函数
 #pragma once & ifnde
#pragma comment
SET容器
 重载[] int& operator[ ]( )
仿函数 operator()()
remove_if erase
vector
map

原文地址：https://www.cnblogs.com/fujian-code/p/9033253.html

Copyright © 2011-2022 走看看