MapReduce实现线性回归

zoukankan html css js c++ java

MapReduce实现线性回归
1. 软件版本号：
Hadoop2.6.0（IDEA中源代码编译使用CDH5.7.3，相应Hadoop2.6.0），集群使用原生Hadoop2.6.4。JDK1.8，Intellij IDEA 14 。
源代码能够在https://github.com/fansy1990/linear_regression 下载。
2. 实现思路：
本博客实现的是一元一次线性方程，等于是最简单的线性方程了。採用的是Couresa里面的机器学习中的大数据线性方程的方法来更新參数值的（即随机梯度下降方法，当然也能够使用批量梯度下降方法来实现，仅仅是在LinearRegressionJob中实现的不一样而已），假设对随机梯度下降或者批量梯度下降不了解的话。须要先去看看。以下是实现思路：
2.1 Shuffle Data（打乱数据）：
假设要採用随机梯度下降的话，那么须要保持原始数据随机，所以这里的第一步就是随机打乱原始数据。
採用的思路是：在Mapper端输出随机值作为key，输出当前记录作为value，在Reducer端直接遍历每一个key的全部values，直接输出value以及NullWritable.get就可以。
在这里加入一个额外的參数randN。这个參数表示在Mapper端随机值时，多少个原始数据使用同一个随机值。假设randN为1。那么每一个原始数据都会使用一个随机值作为key。假设randN为2，那么每两个原始数据使用一个随机值，假设randN为0或小于0。那么全部数据都使用同一个随机值（注意，这个时候事实上在Reducer端的values事实上也是乱序的，请读者思考为什么？）。
其Mapper中map核心实现例如以下所看到的
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { if(randN <= 0) { // 假设randN 比0小。那么不再次打乱数据 context.write(randFloatKey,value); return ; } if(++countI >= randN){// 假设randN等于1。那么每次随机的值都是不一样的 randFloatKey.set(random.nextFloat()); countI =0; } context.write(randFloatKey,value); }
2.2 Linear Regression（线性回归）：
线性回归採用随机梯度下降的方法来更新theta0和theta1 （仅仅实现了一元一次，所以仅仅有两个參数），每一个Mapper都会使用相同的初始化參数（theta0=1和theta1=0），在每一个Mapper中使用自己的数据来更新theta0和theta1，更新的公式为：
theta0 = theta0 -alpha*(h(x)-y)x theta1 = theta1 -alpha*(h(x)-y)x
当中，h(x)= theta0 + theta1 * x ；同一时候。须要注意这里的更新是同步更新，其核心代码例如以下所看到的：
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { float[] xy = Utils.str2float(value.toString().split(splitter)); float x = xy[0]; float y = xy[1]; // 同步更新 theta0 and theta1 lastTheta0 = theta0; theta0 -= alpha *(theta0+theta1* x - y) * x; // 保持theta0 和theta1 不变 theta1 -= alpha *(lastTheta0 + theta1 * x -y) * x;// 保持theta0 和theta1 不变 }
然后在每一个Mapper的cleanup函数中直接输出theta的參数值就可以
protected void cleanup(Context context) throws IOException, InterruptedException { theta0_1.set(theta0 + splitter + theta1); context.write(theta0_1,NullWritable.get()); }
因为在每一个mapper中已经更新了theta的各个參数值，所以不须要使用reducer就可以；同一时候。因为測试数据比較小。所以设置mapreduce.input.fileinputformat.split.maxsize的大小，读者须要依据自己实际数据的大小来设置。其Driver类核心代码例如以下所看到的：
conf.setLong("mapreduce.input.fileinputformat.split.maxsize",700L);// 获取多个mapper； job.setNumReduceTasks(0);
2.3 Combine Theta （合并參数值）：
在2.2步中已经算得了各个theta值。那么应该怎样来合并这些求得得各个theta值呢？能够直接用平均值么？对于一元一次线性回归是能够直接使用平均值来作为终于合并后的theta值的，可是针对其它的线性回归（特指有多个局部最小值的线性回归。这样求得的多个theta值合并就会有问题了）。
假设仅仅是使用平均值的话。那么在2.2步事实上加一个Reducer就能够完毕了，这里提出了一种另外的方式来合并theta值。即採用各个theta值的全局误差作为參数来进行加权。所以，在Mapper的setup中会读取2.2中的多个输出theta值。在map函数中针对各个原始数据求其误差，输出到reducer的数据为theta值和其误差；其核心代码例如以下所看到的：
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { float[] xy = Utils.str2float(value.toString().split(splitter)); for(int i =0;i<thetas.size() ;i++){ // error = (theta0 + theta1 * x - y) ^2 thetaErrors[i] += (thetas.get(i)[0]+ thetas.get(i)[1] * xy[0] -xy[1]) * (thetas.get(i)[0]+ thetas.get(i)[1] * xy[0] -xy[1]) ; thetaNumbers[i]+= 1; } }
protected void cleanup(Context context) throws IOException, InterruptedException { for(int i =0;i<thetas.size() ;i++){ theta.set(thetas.get(i)); floatAndLong.set(thetaErrors[i],thetaNumbers[i]); context.write(theta,floatAndLong); } }
在Reducer端。直接针对每一个键（也就是theta值）把各个误差加起来，在cleanup函数中採用加权来合并theta值，其核心代码例如以下所看到的：
protected void reduce(FloatAndFloat key, Iterable<FloatAndLong> values, Context context) throws IOException, InterruptedException { float sumF = 0.0f; long sumL = 0L ; for(FloatAndLong value:values){ sumF +=value.getSumFloat(); sumL += value.getSumLong(); } theta_error.add(new float[]{key.getTheta0(),key.getTheta1(), (float)Math.sqrt((double)sumF / sumL)}); logger.info("theta:{}, error:{}", new Object[]{key.toString(),Math.sqrt(sumF/sumL)}); }
protected void cleanup(Context context) throws IOException, InterruptedException { // 怎样加权？ // 方式1：假设误差越小。那么说明权重应该越大； // 方式2：直接平均值 float [] theta_all = new float[2]; if("average".equals(method)){ // theta_all = theta_error.get(0); for(int i=0;i< theta_error.size();i++){ theta_all[0] += theta_error.get(i)[0]; theta_all[1] += theta_error.get(i)[1]; } theta_all[0] /= theta_error.size(); theta_all[1] /= theta_error.size(); } else { float sumErrors = 0.0f; for(float[] d:theta_error){ sumErrors += 1/d[2]; } for(float[] d: theta_error){ theta_all[0] += d[0] * 1/d[2] /sumErrors; theta_all[1] += d[1] * 1/d[2] /sumErrors; } } context.write(new FloatAndFloat(theta_all),NullWritable.get()); }
2.4 验证
这里的验证指的是使用2.3步求的得合并后的theta值求全局误差，因为在2.3步也求得了各个theta值的全局误差。所以这里能够对照看下哪个theta值最优；其Mapper能够直接使用2.3步骤的mapper，而reducer也相似2.3步骤中的reducer，仅仅是终于输出就不须要cleanup中的合并了。
3. 执行结果：
3.1 shuffle Job
測试类：
public static void main(String[] args) throws Exception { args = new String[]{ "hdfs://master:8020/user/fanzhe/linear_regression.txt", "hdfs://master:8020/user/fanzhe/shuffle_out", "1" } ; ToolRunner.run(Utils.getConf(),new ShuffleDataJob(),args); }

原始数据：（能够在源代码中的resource文件夹中下载 linear_regression.txt）
6.1101,17.592 5.5277,9.1302 8.5186,13.662 。
。
。
Shuffle输出：
每次输出应该都是不一样的（使用了随机数），能够看到数据确实被随机化了。
3.2 Linear Regression
測试类：
public static void main(String[] args) throws Exception { // <input> <output> <theta0;theta1;alpha> <splitter> // 注意第三个參数使用分号切割 args = new String[]{ "hdfs://master:8020/user/fanzhe/shuffle_out", "hdfs://master:8020/user/fanzhe/linear_regression", "1;0;0.01", "," } ; ToolRunner.run(Utils.getConf(),new LinearRegressionJob(),args); }
查看输出结果：
从输出结果能够看出。两个结果相差还是非常大的，这个主要是因为測试数据比較少的原因。假设数据比較大。而且被非常好的shuffle的话。那么这两个值应该是相差不大的；
3.3 Combine Theta
測试类：
public static void main(String[] args) throws Exception { // <input> <output> <theta_path> <splitter> <average|weight> args = new String[]{ "hdfs://master:8020/user/fanzhe/shuffle_out", "hdfs://master:8020/user/fanzhe/single_linear_regression_error", "hdfs://master:8020/user/fanzhe/linear_regression", ",", "weight" } ; ToolRunner.run(Utils.getConf(),new SingleLinearRegressionError(),args); }
这里设置的合并theta值的方式使用加权。读者能够设置为average，从而使用平均值；
结果：

依据日志能够看出theta參数值选取以下的一个，其误差会比較小，合并后的參数值为：

看到其结果是在两个theta參数值之间。
假设是平均值。那么其输出结果为：

3.4 验证
验证測试类：
public static void main(String[] args) throws Exception { // <input> <output> <theta_path> <splitter> args = new String[]{ "hdfs://master:8020/user/fanzhe/shuffle_out", "hdfs://master:8020/user/fanzhe/last_linear_regression_error", "hdfs://master:8020/user/fanzhe/single_linear_regression_error", ",", } ; ToolRunner.run(Utils.getConf(),new LastLinearRegressionError(),args); }
输出结果为：

从结果中能够看出，合并后的结果并没有原来当中的一个Theta參数组值的效果好，只是这个也可能和数据量有关，依据输出结果。也能够把合并后的theta值以及合并前的对照。然后使用最优的theta来作为最后的输出。
假设是平均值，那么其输出结果为：

从上面的结果能够看到加权的组合比平均值的组合效果好点。
4. 总结
1. 改算法仅仅针对有一个局部最优解（也就是全局最优解）的情况，否则，在合并阶段会有问题。
2. 通过小量数据验证，使用合并后的效果并没有使用合并前的最优解的效果好，这个可能是数据问题，待验证；
3. 通过非常直观的想象，普通情况下使用加权组合要比平均组好效果好。
分享，成长。快乐
转载请注明blog地址：http://blog.csdn.net/fansy1990
查看全文

相关阅读:
大二下每周总结
 大二下学期之阅读笔记(人月神话）
大二下学期之阅读笔记（人月神话）
大二下学期第一次结对作业（第一阶段：地图下钻）
大二下学期第一次结对作业（第一阶段）
java 多线程编程之: synchronized
书籍
 elasticsearch size 设置最大返回条数
 Java 设计模式--策略模式，枚举+工厂方法实现
 Elasticsearch rollover API

原文地址：https://www.cnblogs.com/zhchoutai/p/8444091.html

MapReduce实现线性回归

1. 软件版本号：

2. 实现思路：

2.1 Shuffle Data（打乱数据）：

2.2 Linear Regression（线性回归）：

2.3 Combine Theta （合并參数值）：

2.4 验证

3. 执行结果：

3.1 shuffle Job

3.2 Linear Regression

3.3 Combine Theta

3.4 验证

4. 总结