第一章 监督学习
1.1 准备工作
如果你是在windows环境下,建议直接使用anaconda,这里里面集成了一些常用的Python库。
如果是在其他环境下,就更方便了,保证这下面几个已经安装就好了。
-
NumPy: http://docs.scipy.org/doc/numpy-1.10.1/user/install.html
-
scikit-learn: http://scikit-learn.org/stable/install.html
-
matplotlib: http://matplotlib.org/1.4.2/users/installing.html
1.2 数据预处理
其实在机器学习的整个过程中,数据预处理的过程是最麻烦和繁琐的,同样对后面的结果也会产生很大的影响。一定要重视!!!
均值移除
- Standardization即标准化,尽量将数据转化为均值为零,方差为一的数据,形如标准正态分布(高斯分布)
- scale 零均值单位方差
import numpy as np
from sklearn import preprocessing
data = np.array([[3, -1.5, 2, -5.4], [0, 4, -0.3, 2.1], [1, 3.3, -1.9, -4.3]])
data
array([[ 3. , -1.5, 2. , -5.4],
[ 0. , 4. , -0.3, 2.1],
[ 1. , 3.3, -1.9, -4.3]])
data_standardized = preprocessing.scale(data)
print("Mean = ", data_standardized.mean(axis = 0))
print("Std deviation = ", data_standardized.std(axis = 0))
data_standardized
Mean = [ 5.55111512e-17 -1.11022302e-16 -7.40148683e-17 -7.40148683e-17]
Std deviation = [1. 1. 1. 1.]
array([[ 1.33630621, -1.40451644, 1.29110641, -0.86687558],
[-1.06904497, 0.84543708, -0.14577008, 1.40111286],
[-0.26726124, 0.55907936, -1.14533633, -0.53423728]])
范围缩放Scaling
data_scaler = preprocessing.MinMaxScaler(feature_range = (0, 1))
data_scaled = data_scaler.fit_transform(data)
data_scaled
array([[1. , 0. , 1. , 0. ],
[0. , 1. , 0.41025641, 1. ],
[0.33333333, 0.87272727, 0. , 0.14666667]])
归一化
data_normalized = preprocessing.normalize(data, norm = 'l1')
data_normalized
array([[ 0.25210084, -0.12605042, 0.16806723, -0.45378151],
[ 0. , 0.625 , -0.046875 , 0.328125 ],
[ 0.0952381 , 0.31428571, -0.18095238, -0.40952381]])
二值化
data_binarized = preprocessing.Binarizer(threshold = 2).transform(data)
data_binarized
array([[1., 0., 0., 0.],
[0., 1., 0., 1.],
[0., 1., 0., 0.]])
独热编码(one-hot-encoding)
encoder = preprocessing.OneHotEncoder()
# 给数据进去,根据每列数据得到编码值
encoder.fit([
[0, 2, 1, 12],
[1, 3, 5, 3],
[2, 3, 2, 12],
[1, 2, 4, 3]
])
encoded_vector = encoder.transform([ [2, 3, 5, 3] ]).toarray()
encoded_vector
array([[0., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0.]])
结果分析
- encoder.fit 训练数据
- 第一列[0, 1, 2, 1]得到的3类特征值[0, 1, 2],独热编码表示为:[100, 010, 001]
- 第二列[2, 3, 3, 2]得到的2类特征值[2, 3],独热编码表示为:[10, 01]
- 第三列[12, 3, 12, 3]得到的2类特征值[3, 12],独热编码表示为:[10, 01]
- 当输入[2, 3, 5, 3]时,第一个2就对应[001],以此类推可得。
1.3 定义一个编码器
import numpy as np
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
input_classes = ['audi', 'ford', 'audi', 'toyota', 'ford', 'bmw']
label_encoder.fit(input_classes)
for i, item in enumerate(label_encoder.classes_):
print(item, '-->', i)
audi --> 0
bmw --> 1
ford --> 2
toyota --> 3
labels = ['toyota', 'ford', 'audi']
encoded_labels = label_encoder.transform(labels)
print("labels = ", labels)
print("encoded_labels = ", encoded_labels)
labels = ['toyota', 'ford', 'audi']
encoded_labels = [3 2 0]
逆向操作,根据数字得到原始的字串
encoded_labels = [2, 1, 0, 3, 1]
decoded_labels = label_encoder.inverse_transform(encoded_labels)
print("encoded_labels = ", encoded_labels)
print("decoded_labels = ", decoded_labels)
encoded_labels = [2, 1, 0, 3, 1]
decoded_labels = ['ford' 'bmw' 'audi' 'toyota' 'bmw']
1.4 创建线性回归器
import sys
import numpy as np
读取文件中数据
- X 表示数据
- Y 表示标记
filename = "data_singlevar.txt"
X = []
y = []
with open(filename, 'r') as f:
for line in f.readlines():
xt, yt = [float(i) for i in line.split(',')]
X.append(xt)
y.append(yt)
将数据分为训练数据集、测试数据集
- 用80%的数据作为训练数据集,20%的数据作为测试数据集
num_training = int(0.8 * len(X))
num_test = len(X) - num_training
# train data
X_train = np.array(X[:num_training]).reshape((num_training, 1))
y_train = np.array(y[:num_training])
# test_data
X_test = np.array(X[num_training:]).reshape((num_test, 1))
y_test = np.array(y[num_training:])
创建回归器对象
from sklearn import linear_model
# 创建线性回归对象
linear_regressor = linear_model.LinearRegression()
# 用训练数据集训练样本
linear_regressor.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
数据拟合
import matplotlib.pyplot as plt
# 在jupyter中直接显示图形
#%matplotlib inline
y_train_pred = linear_regressor.predict(X_train)
plt.figure()
plt.scatter(X_train, y_train, color = 'green')
plt.plot(X_train, y_train_pred, color='red', linewidth=4)
plt.title('Training data')
plt.show()
y_train_pred
array([4.850913 , 2.29390029, 1.16834408, 0.5369345 , 2.43508504,
1.52130596, 3.05472923, 1.64288172, 3.4273001 , 3.76457478,
4.06655328, 2.552739 , 2.5566608 , 3.39984751, 3.52534506,
1.28991984, 4.38421897, 4.54109091, 3.04296383, 4.25087781,
3.80379277, 3.93321212, 3.32925513, 3.32141154, 3.9881173 ,
2.63509677, 1.83504985, 3.1292434 , 1.56052395, 3.34102053,
3.88222874, 0.42320234, 3.63123363, 2.64686217, 1.4114956 ,
2.11741935, 4.14106745, 3.27434995, 4.49010753, 4.43912415])
将测试数据放入模型进行预测,查看效果
y_test_pred = linear_regressor.predict(X_test)
plt.scatter(X_test, y_test, color = 'green')
plt.plot(X_test, y_test_pred, color = 'red', linewidth = 2)
plt.title('Test Data')
Text(0.5, 1.0, 'Test Data')
1.5 计算回归准确性
准确性评估
- 平均绝对误差(mean absolute error):这是给定数据集的所有数据点的绝对误差平均值。
- 均方误差(mean squared error):所有数据点的误差的平方的平均值。最常用。
- 中位数绝对误差(median absolute error):左右数据点的误差的中位数。
- 解释方差分(explained variance score):用来衡量我们的模型对数据集波动的解释能力。得分1.0表示模式是完美的。
- R方得分(R2 score):这个指标读作“R方”,是指确定性相关系数,用于衡量模型对未知的样本预测的效果。最好得分是1.0。
import sklearn.metrics as sm
print("Mean absolute error = ", round(sm.mean_absolute_error(y_test, y_test_pred), 2))
print("Mean squared error = ", round(sm.mean_squared_error(y_test, y_test_pred), 2))
print("Median absolute error = ", round(sm.median_absolute_error(y_test, y_test_pred), 2))
print("Explained variance score = ", round(sm.explained_variance_score(y_test, y_test_pred), 2))
print ("R2 score =", round(sm.r2_score(y_test, y_test_pred), 2))
Mean absolute error = 0.54
Mean squared error = 0.38
Median absolute error = 0.54
Explained variance score = 0.68
R2 score = 0.68
- 通常尽量保持均方误差最低,解释方差分最高
1.6 保存模型数据
保存模型数据
import pickle as pickle
output_model_file = 'saved_model.pkl'
with open(output_model_file, 'wb') as f:
pickle.dump(linear_regressor, f)
加载模型数据
with open(output_model_file, 'rb') as f:
model_linregr = pickle.load(f)
y_test_pred_new = model_linregr.predict(X_test)
print("", round(sm.mean_absolute_error(y_test, y_test_pred_new), 2))
0.54