本文主要介绍机器学习中的自适应提升算法Adaboost,主要参考为李航老师的《统计学习方法》和这位同学的博客:一个很有才的同学的博客,我实验时候的代码即是由参考这位同学的改写而成。其实本人接触机器学习时间并不很长,如果本文哪里不对,还请各位看官批评指正!
机器学习是一门很有前途的学科。近些年来,机器学习也算一个热门话题。Google公司的超级大脑,百度的深度研究院,等等。前不久,Google也对外公开发布了自己的无人驾驶汽车,微软还有用于Skype的实时翻译。
Adaboost就是一种很强大的机器学习方法,是使用多个“弱分类器”构造一个“强分类器”的方法,“猪队友”的合作也能变身大神。如果给定一个数据集让一个分类器来分类,可以想见,构造一个很简单但是分类效果差强人意的分类器肯定比一个设计精巧,分类效果很好的分类器简单。所谓的“弱分类器”就是指这种很简单的分类器,究竟有多弱呢?这个“弱”体现在分类效果仅仅比随机猜(分类正确的概率是50%)要好。这种“弱”可以说是极易满足的。因为即使你拿出来的分类器的正确率甚至不到50%,那么只需要把结果取个反就好了。所谓“强分类器”自然与弱相对,是指分类效果很好的分类器。但是疑问随之而来,“猪队友”们真的能够变身大神吗?这个问题已经有大牛们探讨过了,结论是肯定的。弱分类器是可以组合成强分类器的。Adaboost即Adaptive Boosting自适应提升方法,便是其中的一种具体的操作方法。
举个例子,判别一个人是男人还是女人,我们可以构造一个分类器,用语言描述如下:
IF 头发长度不超过15cm THEN 是男人 ELSE 是女人
这个分类器不可谓不简单,只用到了一条规则,判断方法可以说是“简单粗暴”。结合生活,不难想象,这个分类器能够取得一定的分类效果,但是我们仍不满意。
再加上一个分类器呢?(这条好像是自黑。。。)
IF 身高大于170 THEN 是男人 ELSE 是女人
此外,我们还能举出其他的一些弱分类器。而Adaboost算法的用处就是将它们捏合在一起,通过各个弱分类器加权投票的方式,最终实现判定某个人是不是男人。所以问题就变成了如何解决各个弱分类器的权重问题。
Adabbost的基本思路如下:
给样本数据分配一个初始权重(比如可以是1/SAMPLE_NUM)。在每一轮中,首先用带有权重的样本数据集确定一个最优弱分类器(被分错的样本的权重加起来最小),然后利用这个弱分类器的分类结果计算该分类器在最终强分类器的权重,并调整各个样本的权重。被它分错的样本权重变大,被分对的样本权重变小。当满足一定的条件时(迭代次数已满或者样本数据完全正确分类了)退出。各个弱分类器的输出根据自己的权重按照线性组合的方式构成最终的强分类器。
/* 初始化样本集中各个样本的权重 w_0 = 1/N ,确定分类器分类器个数 T * i = 0 * while i < T * 根据样本的权重,取分类误差率最小的弱分类器 * 分类误差率err即被该分类器分类错误的样本权重相加的和 * 权重 alpha_i = log(1-err)/err ; * 改变样本权重: * if 此轮的最优弱分类器将样本分类正确,则eta = exp(-alpha) * else eta = exp(alpha) * w_(i+1) = w_i * beta ; * w_(i+1)归一化,使得其相加和是1 * 得到强分类器 = alpha_i * f_i(x) */
或者如此图所示(符号表示略有不同):
这样就实现了将若干个弱分类器结合成强分类器的过程。
接下来附上示例代码,这段代码使用了一些OpenCV库函数(主要用于可视化),将二维平面内不同位置的点分开,所用的弱分类器为很简单的if_else规则。
其中,f_j(x)是特征向量x的一个元素。 p_j取值为+1或者-1,起到改变不等号方向的作用(具体可见李航老师的“统计学习方法”)。
由weakclassifier.h和weakclassifier.cpp文件以及测试文件main.cpp组成。首先是weakclassifier.h文件:
#ifndef _WEAKCLASSIFIER_H #define _WEAKCLASSIFIER_H #include <opencv2/opencv.hpp> #include <opencv2/core/core.hpp> #include <opencv2/highgui/highgui.hpp> #include <iostream> #include <vector> #include <algorithm> using namespace std; using namespace cv; #define P_SAMPLE_NUM 300 //正样本个数 #define N_SAMPLE_NUM 400 //负样本个数 #define SAMPLE_NUM (P_SAMPLE_NUM+N_SAMPLE_NUM) #define MAX_FEATURE 40 #define X_MAX 300 #define Y_MAX 300 //define the class weakclassifier //#define CIRCLE_TEST // 弱分类器的判别函数 int h_fun (double x , double thresh , int parity); // 样本数据 // feature数组是特征 // label是样本对应的标签 struct SampleData { double _features[MAX_FEATURE]; int _label; double _eigen; double _weight; }; struct Weakclassifier { public: double _threshold; int _feature; int _parity; double _error; double _alpha; }; struct StrongClassifier { int _nweak; //弱分类器的个数,也是训练中迭代的次数 vector<Weakclassifier> _weak; //弱分类器 //SampleData _samples[SAMPLE_NUM]; SampleData* _samples; //指向样本数据的指针 //用于演示的图像 IplImage* frame; StrongClassifier (SampleData* samples, int nweak) { this->_nweak = nweak; this->_samples = samples; this->frame = cvCreateImage (cvSize (X_MAX , Y_MAX) , IPL_DEPTH_8U , 3); //show the image for (int i = 0; i < Y_MAX; i++) for (int j = 0; j < X_MAX; j++) CV_IMAGE_ELEM (frame , uchar , i , 3 * j) = CV_IMAGE_ELEM (frame , uchar , i , 3 * j + 1) = CV_IMAGE_ELEM (frame , uchar , i , 3 * j + 2) = 255; } void getEigenVal (const int& feature); void sortSamples (); // 改变样本权重 void updateWeight (const Weakclassifier& best); // 训练 void train (); // 在每一步迭代中得到最优弱分类器 bool getWeakclassifier (Weakclassifier& weakc,const int& feature); // 弱分类器结果,改变样本权重时候调用 int getClassifyResult (const SampleData& s); // 将结果可视化 void drawResult (); void display () { cvShowImage ("show" , frame); } }; void GenerateSampleData (SampleData sample_ori[]); bool Compare_fun (SampleData s1 , SampleData s2); #endif // WEAKCLASSIFIER_H
weakclassifier.cpp:
#include "weakclassifier.h" //define the class weakclassifier int h_fun (double x , double thresh , int parity) { return (parity*x<parity*thresh) ? 1 : -1; } bool Compare_fun (SampleData s1 , SampleData s2) { return s1._eigen < s2._eigen ? true : false; } void swapSampleData (SampleData& s1 , SampleData& s2) { SampleData tmp = s2; s2 = s1; s1 = tmp; return; } void generateFeatures (SampleData& s,const int x,const int y) { #ifdef CIRCLE_TEST for (int i = 0; i<MAX_FEATURE-1;i++) { s._features[i]= std::cos (CV_PI*i/MAX_FEATURE)*x+ std::sin (CV_PI*i/MAX_FEATURE)*y; } s._features[MAX_FEATURE-1] = (x-150)*(x-150)+(y-150)*(y-150); #else for (int i = 0; i<MAX_FEATURE;i++) { s._features[i] = std::cos (CV_PI*i/MAX_FEATURE)*x+ std::sin (CV_PI*i/MAX_FEATURE)*y; } #endif return; } void StrongClassifier::sortSamples () { std::sort (_samples , _samples+SAMPLE_NUM , Compare_fun); /* SampleData* Psamples = &(this->_samples[0]); for (int i = 0; i<SAMPLE_NUM;i++) { double mineigen = Psamples[i]._eigen; int index = i; for (int j = i; j<SAMPLE_NUM; j++) { if (mineigen>Psamples[j]._eigen) { index = j; mineigen = Psamples[j]._eigen; } } if (index!=i) swapSampleData (Psamples[index] , Psamples[i]); } */ return; } // 得到Eigen的值 void StrongClassifier::getEigenVal (const int& feature) { for (int i = 0; i<SAMPLE_NUM;i++) { this->_samples[i]._eigen = this->_samples[i]._features[feature]; } return; } bool StrongClassifier::getWeakclassifier (Weakclassifier& weakc , const int& feature) { /*1.将SAMPLEDATA按照feature的值顺序排列*/ // 产生eigen this->getEigenVal (feature); // 排序 this->sortSamples (); /**/ // 统计正样本权重和负样本权重 double pos_weight = 0; double neg_weight = 0; for (int i = 0; i<SAMPLE_NUM; i++) { if (_samples[i]._label==1) pos_weight += _samples[i]._weight; else neg_weight += _samples[i]._weight; } // 按照训练算法训练 double loss_pos_weight = 0 , loss_neg_weight = 0; double besterror = 0.5; int bestparity = 0; double bestthresh = -1; // for (int i = 1; i<SAMPLE_NUM; i++) { if (_samples[i-1]._label==1) loss_pos_weight +=_samples[i-1]._weight; else loss_neg_weight += _samples[i-1]._weight; // FP + FN if ((loss_pos_weight + neg_weight - loss_neg_weight) < besterror) { besterror = loss_pos_weight + neg_weight - loss_neg_weight; bestparity = -1; //the optimal threshold is the half of the sum of kth and (k+1)th bestthresh = (_samples[i]._eigen + _samples[i-1]._eigen) / 2; } // FN+FP else if (loss_neg_weight + pos_weight - loss_pos_weight < besterror) { besterror = loss_neg_weight + pos_weight - loss_pos_weight; bestparity = 1; bestthresh = (_samples[i]._eigen + _samples[i-1]._eigen) / 2; } } CV_Assert (besterror>=0); weakc._threshold = bestthresh; weakc._error = besterror; weakc._parity = bestparity; weakc._feature = feature; weakc._alpha = 0.5*std::log ((1.0-besterror)/(besterror+1E-8)); return true; } //训练 强分类器 void StrongClassifier::train () { int classifier_num = this->_nweak; //要用到的弱分类器数量 Weakclassifier besth , h_tmp; for (int i = 0; i<classifier_num; i++) //最多这么多弱分类器,每个弱分类器对应了一个特征 { /*1 . 找到 使得加权误差最小的弱分类器 */ //找到最优的分类器 double curerrror = 0.5; for (int j = 0; j<MAX_FEATURE; j++) { this->getWeakclassifier (h_tmp , j); if (h_tmp._error<curerrror) { curerrror = h_tmp._error; besth = h_tmp; } } CV_Assert (curerrror<0.5); this->_weak.push_back (besth); //找到了这次迭代步骤中的最优弱分类器 std::cout<<"****************************************************"<<endl; std::cout<<"Best Classifier :" <<i<<" Complete! " <<endl; std::cout<<"Threshold: "<<besth._threshold<<" "<<"Parity: "<<besth._parity<<" " <<endl<<"Error "<<besth._error<<" "<<"Alpha "<<besth._alpha<<" "<< "Feature Index:"<<besth._feature<<endl; //update the weight this->updateWeight (besth); int errorcount = 0; for (int j = 0; j<SAMPLE_NUM; j++) { SampleData* Ps = &(_samples[j]); if (this->getClassifyResult (*Ps)!=Ps->_label) errorcount++; } cout<<"There is "<<errorcount<< " error !"<<endl; cout<<"--------------------------------------"<<endl; /*画图*/ this->drawResult (); int c = waitKey (); while (c!=27) { c = waitKey (); } if (errorcount==0) { break; } } return; } void StrongClassifier::drawResult () { for (int i = 0; i < Y_MAX; i++) for (int j = 0; j < X_MAX; j++) CV_IMAGE_ELEM (frame , uchar , i , 3 * j) = CV_IMAGE_ELEM (frame , uchar , i , 3 * j + 1) = CV_IMAGE_ELEM (frame , uchar , i , 3 * j + 2) = 0; for (int y = 0; y<Y_MAX; y += 1) { for (int x = 0; x<X_MAX; x += 1) { SampleData s; generateFeatures (s , x , y); int label = this->getClassifyResult (s); if (label==1) { CV_IMAGE_ELEM (frame , uchar , y , 3 * x + 1) = 255; } else { CV_IMAGE_ELEM (frame , uchar , y , 3 * x + 2) = 255; } } } cvShowImage ("Img" , this->frame); } int StrongClassifier::getClassifyResult (const SampleData& s) { double res = 0; int curWeakNum = this->_weak.size (); Weakclassifier* Pweak; for (int i = 0; i<curWeakNum; i++) { Pweak = &(this->_weak[i]); res += Pweak->_alpha*h_fun (s._features[Pweak->_feature], Pweak->_threshold , Pweak->_parity); } int label = res>0 ? 1 : -1; return label; } void StrongClassifier::updateWeight (const Weakclassifier& best) { double weight_sum = 0; double weight[SAMPLE_NUM]; double weight_tmp; for (int i = 0; i<SAMPLE_NUM;i++) { SampleData* Ps = _samples+i; int label = h_fun (Ps->_features[best._feature] , best._threshold , best._parity); CV_Assert (Ps->_label==1 || Ps->_label==-1); if (label!=Ps->_label) //预测错了 { weight_tmp = Ps->_weight*std::sqrt ((1-best._error)/best._error); } else { weight_tmp = Ps->_weight*std::sqrt (best._error/(1-best._error)); } weight_sum += weight_tmp; weight[i] = weight_tmp; } for (int i = 0; i<SAMPLE_NUM;i++) { SampleData* Ps = _samples+i; Ps->_weight = weight[i]/weight_sum; } return; }
测试用的main.cpp:
#include "weakclassifier.h" #include <iostream> #include <time.h> using std::cout; using std::cin; using namespace cv; double point_x[SAMPLE_NUM] = { }; double point_y[SAMPLE_NUM] = { }; void generateTrainMat (Mat& trainMat) { trainMat.create (SAMPLE_NUM , MAX_FEATURE+1 , CV_64FC1); //trainMat srand (time (0)); int counter = 0; int random_x , random_y; while (counter < P_SAMPLE_NUM) { random_x = rand () % 300 - 150; random_y = rand () % 300 - 150; #ifdef CIRCLE_TEST if (random_x * random_x + random_y * random_y > 2500 && random_x*random_x+random_y*random_y<3600) continue; #else if (random_x * random_x + random_y * random_y > 2500) continue; #endif point_x[counter] = random_x + 150; point_y[counter] = random_y + 150; trainMat.at<double> (counter , 0) = 1; #ifdef CIRCLE_TEST for (int j = 0; j<MAX_FEATURE-1; j++) trainMat.at<double> (counter , j+1) = std::cos (CV_PI*j/MAX_FEATURE)*point_x[counter]+ std::sin (CV_PI*j/MAX_FEATURE)*point_y[counter]; trainMat.at<double> (counter , MAX_FEATURE) = random_x * random_x + random_y * random_y; #else for (int j = 0; j<MAX_FEATURE; j++) trainMat.at<double> (counter , j+1) = std::cos (CV_PI*j/MAX_FEATURE)*point_x[counter]+ std::sin (CV_PI*j/MAX_FEATURE)*point_y[counter]; #endif counter++; } while (counter < SAMPLE_NUM) { random_x = rand () % 300 - 150; random_y = rand () % 300 - 150; #ifdef CIRCLE_TEST if (random_x * random_x + random_y * random_y < 2500 || random_x*random_x+random_y*random_y>3600) continue; #else if (random_x * random_x + random_y * random_y < 2500) continue; #endif point_x[counter] = random_x + 150; point_y[counter] = random_y + 150; trainMat.at<double> (counter , 0) = -1; #ifdef CIRCLE_TEST for (int j = 0; j<MAX_FEATURE-1; j++) trainMat.at<double> (counter , j+1) = std::cos (CV_PI*j/MAX_FEATURE)*point_x[counter]+ std::sin (CV_PI*j/MAX_FEATURE)*point_y[counter]; trainMat.at<double> (counter , MAX_FEATURE) = random_x * random_x + random_y * random_y; #else for (int j = 0; j<MAX_FEATURE; j++) trainMat.at<double> (counter , j+1) = std::cos (CV_PI*j/MAX_FEATURE)*point_x[counter]+ std::sin (CV_PI*j/MAX_FEATURE)*point_y[counter]; #endif counter++; } } void displayPic (const StrongClassifier& cl) { // display int i = 0; for (i = 0; i<P_SAMPLE_NUM;i++) { // display cvCircle (cl.frame , cvPoint (int (point_x[i]) , int (point_y[i])) , 3 , cvScalar (0 , 255 , 0) , 1); } for (; i<SAMPLE_NUM; i++) { cvCircle (cl.frame , cvPoint (int (point_x[i]) , int (point_y[i])) , 3 , cvScalar (0 , 0 , 255) , 1); } return; } void generateSampleData (SampleData* samples , const Mat& trainMat) { for (int i = 0; i<SAMPLE_NUM; i++) { samples[i]._label = (int) (trainMat.at<double> (i , 0)); for (int j = 1; j<=MAX_FEATURE; j++) { samples[i]._features[j-1] = trainMat.at<double> (i , j); } samples[i]._weight = 1.0/SAMPLE_NUM; } } int main () { Mat trainMat; generateTrainMat (trainMat); SampleData samples[SAMPLE_NUM]; generateSampleData (samples , trainMat); StrongClassifier classifier(samples,100); displayPic (classifier); classifier.display (); classifier.train (); waitKey (); }
运行结果:
首先是样本数据集,
迭代一次的分类结果:
迭代5次:
迭代28次之后,
迭代31次,发现分类错误数为0.