很多框架都会提供一种Pipeline的机制,通过封装一系列操作的流程,调用时按计划执行即可。比如netty中有ChannelPipeline,TensorFlow的计算图也是如此。
下面简要介绍sklearn中pipeline的使用:
from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split # 定义类别型特征预处理器 categorical_transformer=Pipeline(steps=[ ('imputer',SimpleImputer(strategy='most_frequent')), ('onehot',OneHotEncoder(handle_unknown='ignore')) ]) # 定义数值型特征预处理器 numerical_transformer=SimpleImputer(strategy='constant') # 将类别与数值型特征预处理器,分别应用于对应列上 preprocessor = ColumnTransformer( transformers=[ ('num', numerical_transformer, ['Age']), ('cat', categorical_transformer, ['Embarked']) ]) # 定义Pipeline,传入预处理器与选择的模型 my_pipeline=Pipeline(steps=[ ('preprocessor',preprocessor), ('model',RandomForestClassifier(n_estimators=100,random_state=0)) ]) # 使用pipeline X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.2,random_state=0) my_pipeline.fit(X_train.copy(),y_train.copy())# 训练,预处理会改变原始数据,不想改变copy一下 preds=my_pipeline.predict(X_valid)# 预测