zoukankan html css js c++ java

Alink漫谈(十五) ：多层感知机之迭代优化

Alink漫谈(十五) ：多层感知机之迭代优化

0x00 摘要

Alink 是阿里巴巴基于实时计算引擎 Flink 研发的新一代机器学习算法平台，是业界首个同时支持批式算法、流式算法的机器学习平台。本文和前文将带领大家来分析Alink中多层感知机的实现。

因为Alink的公开资料太少，所以以下均为自行揣测，肯定会有疏漏错误，希望大家指出，我会在日后随时更新。

0x01 前文回顾

从前文 ALink漫谈(十四) ：多层感知机之总体架构我们了解了多层感知机的概念以及Alink中的整体架构，下面我们就要开始介绍如何优化。

1.1 基本概念

我们再次温习下基本概念：

神经元输入：类似于线性回归 z = w1x1+ w2x2 +⋯ + wnxn = wT・x（linear threshold unit (LTU)）。
神经元输出：激活函数，类似于二值分类，模拟了生物学中神经元只有激发和抑制两种状态。增加偏值，输出层哪个节点权重大，输出哪一个。
采用Hebb准则，下一个权重调整方法参考当前权重和训练效果。
如何自动化学习计算权重——backpropagation，即首先正向做一个计算，根据当前输出做一个error计算，作为指导信号反向调整前一层输出权重使其落入一个合理区间，反复这样调整到第一层，每轮调整都有一个学习率，调整结束后，网络越来越合理。

1.2 误差反向传播算法

基于误差反向传播算法（backpropagation，BP）的前馈神经网络训练过程可以分为以下三步：

在前向传播时计算每一层的净输入z(l)和激活值a(l)，直至最后一层；
(反向传播阶段)将激励响应同训练输入对应的目标输出求差，从而获得隐层和输出层的响应误差。用误差反向传播计算每一层的误差项 δ(l)；
对于每个突触上的权重，按照以下步骤进行更新：将输入激励和响应误差相乘，从而获得权重的梯度；将这个梯度乘上一个比例并取反后加到权重上。即计算每一层参数的偏导数，并更新参数。

这个比例(百分比)将会影响到训练过程的速度和效果，因此称为”训练因子”。梯度的方向指明了误差扩大的方向，因此在更新权重的时候需要对其取反，从而减小权重引起的误差。

1.3 总体逻辑

我们回顾下总体逻辑。

在这里插入图片描述

0x02 训练神经网络

这部分是重头戏，所以要从总体逻辑中剥离出来单独说。

FeedForwardTrainer.train 函数会完成神经网络的训练，即：

DataSet<DenseVector> weights = trainer.train(trainData, getParams());

其逻辑大致如下：

1）初始化模型 initCoef = initModel(data, this.topology);
2）把训练数据压缩成向量，trainData = stack()。这里应该是为了减少传输数据量。需要注意，其把输入数据从二元组<label index, vector> 转换为三元组 Tuple3.of((double) count, 0., stacked);
3）生成优化目标函数 new AnnObjFunc(topology...)
4）构建训练器 FeedForwardTrainer
5）训练器会基于目标函数构建优化器 Optimizer optimizer = new Lbfgs，即使用 L-BFGS 来训练神经网络；

2.1 初始化模型

初始化模型相关代码如下：

DataSet<DenseVector> initCoef = initModel(data, this.topology); // 随机初始化权重系数
optimizer.initCoefWith(initCoef);

public void initCoefWith(DataSet<DenseVector> initCoef) {
	this.coefVec = initCoef; // 就是赋值在内部变量上
}

这里需要和线性回归比较下：

线性回归中，假如有两个特征，则初始系数为 coef = {DenseVector} "0.001 0.0 0.0 0.0"，4个数值具体是 "权重, b, w1, w2"。
多层感知机这里，初始化系数则是一个 DenseVector。

initModel 主要就是随机初始化权重系数。精华函数在 topology.getWeightSize() 部分。

private DataSet<DenseVector> initModel(DataSet<?> inputRel, final Topology topology) {
        if (initialWeights != null) {
          ......
        } else {
            return BatchOperator.getExecutionEnvironmentFromDataSets(inputRel).fromElements(0)
                .map(new RichMapFunction<Integer, DenseVector>() {
                    ......
                    @Override
                    public DenseVector map(Integer value) throws Exception {
                        // 这里是精华，获取了系数的size
                        DenseVector weights = DenseVector.zeros(topology.getWeightSize());                           
                        
                        for (int i = 0; i < weights.size(); i++) {
                            weights.set(i, random.nextGaussian() * initStdev);//随机初始化
                        }
                        return weights;
                    }
                })
                .name("init_weights");
        }
}

topology.getWeightSize() 调用的是 FeedForwardTopology的函数，就是遍历各层，累加求系数和。

@Override
public int getWeightSize() {
        int s = 0;
        for (Layer layer : layers) { //遍历各层
            s += layer.getWeightSize();
        }
        return s;
}

AffineLayer系数大小如下：

@Override
public int getWeightSize() {
		return numIn * numOut + numOut;
}

FuntionalLayer 没有系数

@Override
public int getWeightSize() {
	return 0;
}

SoftmaxLayerWithCrossEntropyLoss 没有系数

@Override
public int getWeightSize() {
	return 0;
}

回顾本例对应拓扑：

this = {FeedForwardTopology@4951} 
 layers = {ArrayList@4944}  size = 4
      0 = {AffineLayer@4947} // 仿射层
       numIn = 4
       numOut = 5
      1 = {FuntionalLayer@4948} 
       activationFunction = {SigmoidFunction@4953}  // 激活函数
      2 = {AffineLayer@4949} // 仿射层
       numIn = 5
       numOut = 3
      3 = {SoftmaxLayerWithCrossEntropyLoss@4950}  // 激活函数

所以最后 DenseVector 向量大小是两个仿射层的和（4 + 5 * 5）+ (5 + 3 * 3) = 43。

2.2 压缩数据

这里把输入数据二元组 <label index, vector> 转换，压缩到 DenseVector中。这里压缩的目的应该是为了减少传输数据量。

如果输入是：

batchData = {ArrayList@11527}  size = 64
 0 = {Tuple2@11536} "(2.0,6.5 2.8 4.6 1.5)"
  f0 = {Double@11572} 2.0
  f1 = {DenseVector@11573} "6.5 2.8 4.6 1.5"
 1 = {Tuple2@11537} "(1.0,6.1 3.0 4.9 1.8)"
  f0 = {Double@11575} 1.0
  f1 = {DenseVector@11576} "6.1 3.0 4.9 1.8"
.....

则最终输出的压缩数据是：

batch = {Tuple3@11532} 
 f0 = {Double@11578} 18.0
 f1 = {Double@11579} 0.0
 f2 = {DenseVector@11580} "6.5 2.8 4.6 1.5 6.1 3.0 4.9 1.8 7.3 2.9 6.3 1.8 5.7 2.8 4.5 1.3 6.4 2.8 5.6 2.1 6.7 2.5 5.8 1.8 6.3 3.3 4.7 1.6 7.2 3.6 6.1 2.5 7.2 3.0 5.8 1.6 4.9 2.4 3.3 1.0 7.4 2.8 6.1 1.9 6.5 3.2 5.1 2.0 6.6 2.9 4.6 1.3 7.9 3.8 6.4 2.0 5.2 2.7 3.9 1.4 6.4 2.7 5.3 1.9 6.8 3.0 5.5 2.1 5.7 2.5 5.0 2.0 2.0 1.0 1.0 2.0 1.0 1.0 2.0 1.0 1.0 2.0 1.0 1.0 2.0 1.0 2.0 1.0 1.0 1.0"

具体代码如下：

static DataSet<Tuple3<Double, Double, Vector>>
    stack(DataSet<Tuple2<Double, DenseVector>> data ...) {
    
        return data
            .mapPartition(new MapPartitionFunction<Tuple2<Double, DenseVector>, Tuple3<Double, Double, Vector>>() {
                @Override
                public void mapPartition(Iterable<Tuple2<Double, DenseVector>> samples,
                                         Collector<Tuple3<Double, Double, Vector>> out) throws Exception {
                    List<Tuple2<Double, DenseVector>> batchData = new ArrayList<>(batchSize);
                    int cnt = 0;
                    Stacker stacker = new Stacker(inputSize, outputSize, onehot);
                    for (Tuple2<Double, DenseVector> sample : samples) {
                        batchData.set(cnt, sample);
                        cnt++;
                        if (cnt >= batchSize) { // 如果已经大于默认的数据块大小，就直接发送
                            // 把batchData的x-vec压缩到DenseVector中
                            Tuple3<Double, Double, Vector> batch = stacker.stack(batchData, cnt);
                            out.collect(batch);
                            cnt = 0;
                        }
                    }

                    // 如果压缩成功，则输出
                    if (cnt > 0) { // 没有大于默认数据块大小，也发送。cnt就是目前的数据块大小，针对本实例，是19，这也是后面能看到 matrix 维度 19 的来源。
                        Tuple3<Double, Double, Vector> batch = stacker.stack(batchData, cnt);
                        out.collect(batch); 
                    }
                }
            })
            .name("stack_data");
}

2.3 生成优化目标函数

回顾关于损失函数和目标函数的说明：

损失函数：计算的是一个样本的误差；
代价函数：是整个训练集上所有样本误差的平均，经常和损失函数混用；
目标函数：代价函数 + 正则化项；

多层感知机中生成目标函数代码如下：

final AnnObjFunc annObjFunc = new AnnObjFunc(topology, inputSize, outputSize, onehotLabel, optimizationParams);

AnnObjFuncs是多层感知机优化的目标函数，其定义如下：

topology 就是神经网络的拓扑；
stacker 是用来压缩解压（后续L-BFGS是向量操作，所以也需要从矩阵到向量来回转换）；
topologyModel 是计算模型；

我们可以看到，在 AnnObjFunc 的 API 中调用时候，会看 AnnObjFunc.topologyModel 是否有值，如果没有就生成。

public class AnnObjFunc extends OptimObjFunc {
    private Topology topology;
    private Stacker stacker;
    private transient TopologyModel topologyModel = null;
    
    @Override
    protected double calcLoss(Tuple3<Double, Double, Vector> labledVector, DenseVector coefVector) {
        // 看 AnnObjFunc.topologyModel 是否有值，如果没有就生成
        if (topologyModel == null) {
            topologyModel = topology.getModel(coefVector);
        } else {
            topologyModel.resetModel(coefVector);
        }
        Tuple2<DenseMatrix, DenseMatrix> unstacked = stacker.unstack(labledVector);
        return topologyModel.computeGradient(unstacked.f0, unstacked.f1, null);
    }

    @Override
    protected void updateGradient(Tuple3<Double, Double, Vector> labledVector, DenseVector coefVector, DenseVector updateGrad) {
        // 看 AnnObjFunc.topologyModel 是否有值，如果没有就生成
        if (topologyModel == null) {
            topologyModel = topology.getModel(coefVector);
        } else {
            topologyModel.resetModel(coefVector);
        }
        Tuple2<DenseMatrix, DenseMatrix> unstacked = stacker.unstack(labledVector);
        topologyModel.computeGradient(unstacked.f0, unstacked.f1, updateGrad);
    }
}

2.4 生成目标函数中的拓扑模型

如上所述，这个具体生成是在 AnnObjFunc 的 API 中调用时候，会看 AnnObjFunc.topologyModel 是否有值，如果没有就生成。这里会根据 FeedForwardTopology 的 layers 来生成拓扑模型。

public TopologyModel getModel(DenseVector weights) {
	FeedForwardModel feedForwardModel = new FeedForwardModel(this.layers);
	feedForwardModel.resetModel(weights);
	return feedForwardModel;
}

拓扑模型定义如下，能够看到里面有具体层 & 层模型。

public class FeedForwardModel extends TopologyModel {
    private List<Layer> layers; //具体层 
    private List<LayerModel> layerModels; //层模型
    /**
     * Buffers of outputs of each layers.
     */
    private transient List<DenseMatrix> outputs = null;
    /**
     * Buffers of deltas of each layers.
     */
    private transient List<DenseMatrix> deltas = null;
    
    public FeedForwardModel(List<Layer> layers) {
        this.layers = layers;
        this.layerModels = new ArrayList<>(layers.size());
        for (int i = 0; i < layers.size(); i++) {
            layerModels.add(layers.get(i).createModel());
        }
    }    
}

优化函数是根据每一层来建立模型，比如

public LayerModel createModel() {
	return new AffineLayerModel(this);
}

下面就看看具体各层模型的状况。

2.4.1 AffineLayerModel

定义如下，其具体函数比如 eval，computePrevDelta，grad 我们后续会提及。

public class AffineLayerModel extends LayerModel {
    private DenseMatrix w;
    private DenseVector b;

    // buffer for holding gradw and gradb
    private DenseMatrix gradw;
    private DenseVector gradb;

    private transient DenseVector ones = null;
}

2.4.2 FuntionalLayerModel

定义如下，其具体函数比如 eval，computePrevDelta，grad 我们后续会提及。

public class FuntionalLayerModel extends LayerModel {
    private FuntionalLayer layer;
}

2.4.3 SoftmaxLayerModelWithCrossEntropyLoss

这个就是针对最终输出层。

SoftmaxWithLoss层包含softmax和求交叉熵两部分。

对于softmax的输入向量 z，其输出向量的第 k 个值为：

[a_k = Softmax(z)_k = frac{e^{Z_k}}{sum_i e^{Z_i}} ]

而交叉熵损失函数：

[Loss = - sum_i y_i . ln(a_i) ]

损失函数对 a 求导，可得到：

[δ_L^i = frac{∂Loss}{∂Z_i} = a_i - y_i ]

具体代码如下，基本就是数学公式的实现。

public class SoftmaxLayerModelWithCrossEntropyLoss extends LayerModel
    implements AnnLossFunction {		
    @Override
    public double loss(DenseMatrix output, DenseMatrix target, DenseMatrix delta) {
        int batchSize = output.numRows();
        MatVecOp.apply(output, target, delta, (o, t) -> t * Math.log(o));
        double loss = -(1.0 / batchSize) * delta.sum();
        MatVecOp.apply(output, target, delta, (o, t) -> o - t);
        return loss;
    }
}

2.4.3 最终模型

最终的模型如下：

this = {FeedForwardModel@10575} 

     layers = {ArrayList@10576}  size = 4
          0 = {AffineLayer@10601} 
          1 = {FuntionalLayer@10335} 
          2 = {AffineLayer@10602} 
          3 = {SoftmaxLayerWithCrossEntropyLoss@10603} 

     layerModels = {ArrayList@10579}  size = 4
          0 = {AffineLayerModel@10581} 
          1 = {FuntionalLayerModel@10433} 
          2 = {AffineLayerModel@10582} 
          3 = {SoftmaxLayerModelWithCrossEntropyLoss@10583} 

     outputs = null
     deltas = null

用图形化简要表示如下（FeedForwardModel中省略了层）：

在这里插入图片描述

2.5 生成优化器

回顾下概念。

针对目标函数最普通的优化是穷举法，对各个可能的权值遍历，找寻使损失最小的一组，但是这种方法会陷入嵌套循环里，使得运行速率大打折扣，于是就有了优化器。优化器的目的使快速找到目标函数最优解。

损失函数的目的是在训练过程中找到最合适的一组权值序列，也就是让损失函数降到尽可能低。优化器或优化算法用于将损失函数最小化，在每个训练周期或每轮后更新权重和偏置值，直到损失函数达到全局最优。

我们要寻找损失函数的最小值，首先一定有一个初始的权值，伴随一个计算得出来的损失。那么我们要想到损失函数最低点，要考虑两点：往哪个方向前进；前进多少距离。

因为我们想让损失值下降得最快，肯定是要找当前损失值在损失函数的切线方向，这样走最快。如果是在三维的损失函数上面，则更加明显，三维平面上的点做切线会产生无数个方向，而梯度就是函数在某个点无数个变化的方向中变化最快的那个方向，也就是斜率最大的那个方向。梯度是有方向的，负梯度方向就是下降最快的方向。

那么梯度下降是怎么运行的呢，刚才我们找到了逆梯度方向来作为我们前进的方向，接下来只需要解决走多远的问题。于是我们引入学习率，我们要走的距离就是梯度的大小*学习率。因为越优化梯度会越小，因此移动的距离也会越来越短。学习率的设置不能太大，否则可能会跳过最低点而导致梯度爆炸或者梯度消失；也不能设置太小，否则梯度下降可能十分缓慢。

优化算法有两种：

一阶优化算法。这些算法使用与参数相关的梯度值最小化或最大化代价函数。一阶导数告诉我们函数是在某一点上递减还是递增，简而言之，它给出了与曲面切线。
二阶优化算法。这些算法使用二阶导数来最小化代价函数，也称为Hessian。由于二阶导数的计算成本很高，所以不常使用二阶导数。二阶导数告诉我们一阶导数是递增的还是递减的，这表示了函数的曲率。二阶导数为我们提供了一个与误差曲面曲率相接触的二次曲面。

多层感知机这里使用L-BFGS优化器来训练神经网络。

Optimizer optimizer = new Lbfgs(
            data.getExecutionEnvironment().fromElements(annObjFunc),
            trainData,
            BatchOperator
                .getExecutionEnvironmentFromDataSets(data)
                .fromElements(inputSize),
            optimizationParams
        );
optimizer.initCoefWith(initCoef);

0x03 L-BFGS训练

这部分又比较复杂，需要单独拿出来说，就是用优化器来训练。

optimizer.optimize()
  .map(new MapFunction<Tuple2<DenseVector, double[]>, DenseVector>() {
    @Override
    public DenseVector map(Tuple2<DenseVector, double[]> value) throws Exception {
        return value.f0;
    }
});

关于 L-BFGS 的细节，可以参见前面文章。 Alink漫谈(十一) ：线性回归之 L-BFGS优化。

这里把 Lbfgs 类拿出来概要说明：

public class Lbfgs extends Optimizer {
    public DataSet <Tuple2 <DenseVector, double[]>> optimize() {
       DataSet <Row> model = new IterativeComQueue()
           ......
          .add(new CalcGradient())
           ......
          .add(new CalDirection(...))
          .add(new CalcLosses(...))
           ......
          .add(new UpdateModel(...))
           ......
          .exec();
    }
}

能够看到其中几个关键步骤：

CalcGradient() 计算梯度
CalDirection(...) 计算方向
CalcLosses(...) 计算损失
UpdateModel(...) 更新模型

算法框架都是基本不变的，所差别的就是具体目标函数和损失函数的不同。比如线性回归采用的是UnaryLossObjFunc，损失函数是 SquareLossFunc。

我们把多层感知机特殊之处填充关键步骤，得到与线性回归的区别。

CalcGradient() 计算梯度
- 1）调用 AnnObjFunc.updateGradient；
  - 1.1）调用目标函数中拓扑模型 topologyModel.computeGradient 来计算
    - 1.1.1）计算各层的输出；forward(data, true)
    - 1.1.2）计算输出层损失；labelWithError.loss
    - 1.1.3）计算各层的Delta；layerModels.get(i).computePrevDelta
    - 1.1.4）计算各层梯度；`layerModels.get(i).grad
CalDirection(...) 计算方向
- 这里的实现没有用到目标函数的拓扑模型。
CalcLosses(...) 计算损失
- 1）调用 AnnObjFunc.updateGradient；
  - 1.1）调用目标函数中拓扑模型 topologyModel.computeGradient 来计算
    - 1.1.1）计算各层的输出；forward(data, true)
    - 1.1.2）计算输出层损失；labelWithError.loss
UpdateModel(...) 更新模型
- 这里的实现没有用到目标函数的拓扑模型。

3.1 CalcGradient 计算梯度

CalcGradient.calc 函数中会调用到目标函数的计算梯度功能。

// calculate local gradient
Double weightSum = objFunc.calcGradient(labledVectors, coef, grad.f0);

objFunc.calcGradient函数就是基类 OptimObjFunc 的实现，此处才会调用到 AnnObjFunc.updateGradient 的具体实现。

updateGradient(labelVector, coefVector, grad);

3.1.1 目标函数

回顾目标函数定义：

public class AnnObjFunc extends OptimObjFunc {
    
    protected void updateGradient(Tuple3<Double, Double, Vector> labledVector, DenseVector coefVector, DenseVector updateGrad) {
        if (topologyModel == null) {
            topologyModel = topology.getModel(coefVector);
        } else {
            topologyModel.resetModel(coefVector);
        }
        Tuple2<DenseMatrix, DenseMatrix> unstacked = stacker.unstack(labledVector);
        topologyModel.computeGradient(unstacked.f0, unstacked.f1, updateGrad);
    }    
}

首先是生成拓扑模型；这步骤已经在前面提到了。
然后是 unstacked = stacker.unstack(labledVector);，解压数据，返回 return Tuple2.of(features, labels);。
最后就是计算梯度；这是 FeedForwardModel 类实现的。

3.1.2 计算梯度

计算梯度（此函数也负责计算损失）代码在 FeedForwardModel.computeGradient，大致逻辑如下：

CalcGradient.calc 会调用 objFunc.calcGradient（OptimObjFunc 的实现）

1）调用 AnnObjFunc.updateGradient；
- 1.1）调用目标函数中拓扑模型 topologyModel.computeGradient 来计算
  - 1.1.1）计算各层的输出；forward(data, true)
  - 1.1.2）计算输出层损失；labelWithError.loss
  - 1.1.3）计算各层的Delta；layerModels.get(i).computePrevDelta
  - 1.1.4）计算各层梯度；`layerModels.get(i).grad

代码如下：

public double computeGradient(DenseMatrix data, DenseMatrix target, DenseVector cumGrad) {
        // data 是 x，target是y
        // 计算各层的输出
        outputs = forward(data, true); 
    
        int currentBatchSize = data.numRows();
        if (deltas == null || deltas.get(0).numRows() != currentBatchSize) {
            deltas = new ArrayList<>(layers.size() - 1);
            int inputSize = data.numCols();
            for (int i = 0; i < layers.size() - 1; i++) {
                int outputSize = layers.get(i).getOutputSize(inputSize);
                deltas.add(new DenseMatrix(currentBatchSize, outputSize));
                inputSize = outputSize;
            }
        }
        int L = layerModels.size() - 1;
        AnnLossFunction labelWithError = (AnnLossFunction) this.layerModels.get(L);
        // 计算损失
        double loss = labelWithError.loss(outputs.get(L), target, deltas.get(L - 1));
        if (cumGrad == null) {
            return loss; // 如果只计算损失，则直接返回。
        }
        // 计算Delta；
        for (int i = L - 1; i >= 1; i--) {
            layerModels.get(i).computePrevDelta(deltas.get(i), outputs.get(i), deltas.get(i - 1));
        }
        int offset = 0;
        // 计算梯度；
        for (int i = 0; i < layerModels.size(); i++) {
            DenseMatrix input = i == 0 ? data : outputs.get(i - 1);
            if (i == layerModels.size() - 1) {
                layerModels.get(i).grad(null, input, cumGrad, offset);
            } else {
                layerModels.get(i).grad(deltas.get(i), input, cumGrad, offset);
            }
            offset += layers.get(i).getWeightSize();
        }
        return loss;
}

3.1.2.1 计算各层的输出

回忆下示例代码，我们设置神经网络是：

.setLayers(new int[]{4, 5, 3})

具体说就是调用各层模型的 eval 函数，其中第一层不好写入循环，所以单独写出。

public class FeedForwardModel extends TopologyModel {
    public List<DenseMatrix> forward(DenseMatrix data, boolean includeLastLayer) {
        .....
        layerModels.get(0).eval(data, outputs.get(0));
        int end = includeLastLayer ? layers.size() : layers.size() - 1;
        for (int i = 1; i < end; i++) {
            layerModels.get(i).eval(outputs.get(i - 1), outputs.get(i));
        }
        return outputs;
	}
}

具体各层调用如下：

AffineLayerModel.eval

这就是简单的仿射变换 WX + b，然后放入output。其中

@Override
public void eval(DenseMatrix data, DenseMatrix output) {
        int batchSize = data.numRows();
        for (int i = 0; i < batchSize; i++) {
            for (int j = 0; j < this.b.size(); j++) {
                output.set(i, j, this.b.get(j));
            }
        }
        BLAS.gemm(1., data, false, this.w, false, 1., output);
}

其中 w, b 都是预置的。

this = {AffineLayerModel@10581} 
 w = {DenseMatrix@10592} "mat[4,5]:
  0.07807905200944776,-0.03040913035034301,.....
"
 b = {DenseVector@10593} "-0.058043439717701296 0.1415366160323592 0.017773419483873353 -0.06802435221045448 0.022751460286303204"
 gradw = {DenseMatrix@10594} "mat[4,5]:
  0.0,0.0,0.0,0.0,0.0
  0.0,0.0,0.0,0.0,0.0
  0.0,0.0,0.0,0.0,0.0
  0.0,0.0,0.0,0.0,0.0
"
 gradb = {DenseVector@10595} "0.0 0.0 0.0 0.0 0.0"
 ones = null

FuntionalLayerModel.eval

实现如下

public void eval(DenseMatrix data, DenseMatrix output) {
        for (int i = 0; i < data.numRows(); i++) {
            for (int j = 0; j < data.numCols(); j++) {
                output.set(i, j, this.layer.activationFunction.eval(data.get(i, j)));
            }
        }
}

类变量为

this = {FuntionalLayerModel@10433} 
 layer = {FuntionalLayer@10335} 
  activationFunction = {SigmoidFunction@10755}

输入是

data = {DenseMatrix@10642} "mat[19,5]:
  0.09069152145840428,-0.4117319046979133,-0.273491600786707,-0.3638766081567865,-0.17552469317567304
"
 m = 19
 n = 5
 data = {double[95]@10668}

其中，activationFunction 就调用到了 SigmoidFunction.eval。

public class SigmoidFunction implements ActivationFunction {
    @Override
    public double eval(double x) {
        return 1.0 / (1 + Math.exp(-x));
    }
}

SoftmaxLayerModelWithCrossEntropyLoss.eval

这里就是计算了最终输出。

    public void eval(DenseMatrix data, DenseMatrix output) {
        int batchSize = data.numRows();
        for (int ibatch = 0; ibatch < batchSize; ibatch++) {
            double max = -Double.MAX_VALUE;
            for (int i = 0; i < data.numCols(); i++) {
                double v = data.get(ibatch, i);
                if (v > max) {
                    max = v;
                }
            }
            double sum = 0.;
            for (int i = 0; i < data.numCols(); i++) {
                double res = Math.exp(data.get(ibatch, i) - max);
                output.set(ibatch, i, res);
                sum += res;
            }
            for (int i = 0; i < data.numCols(); i++) {
                double v = output.get(ibatch, i);
                output.set(ibatch, i, v / sum);
            }
        }
    }

3.1.2.2 计算损失

代码是：

AnnLossFunction labelWithError = (AnnLossFunction) this.layerModels.get(L);
double loss = labelWithError.loss(outputs.get(L), target, deltas.get(L - 1));
if (cumGrad == null) {
	return loss; // 可以直接返回
}

如果不需要计算梯度，则直接返回，否则继续进行。我们这里继续执行。

具体就是调用SoftmaxLayerModelWithCrossEntropyLoss的损失函数，就是输出层的损失。

output就是输出层，target就是label y。计算损失几乎和常规一样，只不过多了一个除以 batchSize。

public double loss(DenseMatrix output, DenseMatrix target, DenseMatrix delta) {
        int batchSize = output.numRows();
        MatVecOp.apply(output, target, delta, (o, t) -> t * Math.log(o));
        double loss = -(1.0 / batchSize) * delta.sum();
        MatVecOp.apply(output, target, delta, (o, t) -> o - t);
        return loss;
}

3.1.2.3 计算delta

梯度下降法需要计算损失函数对参数的偏导数，如果用链式法则对每个参数逐一求偏导，涉及到矩阵微分，效率比较低。所以在神经网络中经常使用反向传播算法来高效地计算梯度。具体就是在利用误差反向传播算法计算出每一层的误差项后，就可以得到每一层参数的梯度。

我们定义delta为隐含层的加权输入影响总误差的程度，即 delta_i 表示第l层神经元的误差项，它用来表示第l层神经元对最终损失的影响，也反映了最终损失对第l层神经元的敏感程度。

上面这个公式就是误差的反向传播公式！因为第l层的误差项可以通过第l+1层的误差项计算得到。反向传播算法的含义是：第 l 层的一个神经元的误差项等于该神经元激活函数的梯度，再乘上所有与该神经元相连接的第l+1层的神经元的误差项的权重和。这里 W 是所有权重与偏置的集合)。

这个可能不好理解，找到三种解释，也许能够帮助大家增加理解：

因为梯度下降法是沿梯度（斜率）的反方向移动，所以我们往回倒，就要乘以梯度，再爬回去。
或者可以看做错误（delta）的反向传递。经过一个线就乘以线的权重，经过点就乘以节点的偏导（sigmoid的偏导形式简洁）。
或者这么理解：输出层误差在转置权重矩阵的帮助下，传递到了隐藏层，这样我们就可以利用间接误差来更新与隐藏层相连的权重矩阵。权重矩阵在反向传播的过程中同样扮演着运输兵的作用，只不过这次是搬运的输出误差，而不是输入信号。

具体代码如下

for (int i = L - 1; i >= 1; i--) {
	layerModels.get(i).computePrevDelta(deltas.get(i), outputs.get(i), deltas.get(i - 1));
}

需要注意的是：

最后一层的delta已经在前一个loss计算出来，通过loss函数参数存储在 deltas.get(L - 1) 之中；
循环是从后数第二层向前计算；

本实例中，output是四层（0~3），deltas是三层（0~2）。

用output第4层的数值，来计算deltas第3层的数值。

用output第3层的数值和 deltas第3层的数值，来计算deltas第2层的数值。具体细化到每层：

AffineLayerModel

public void computePrevDelta(DenseMatrix delta, DenseMatrix output, DenseMatrix prevDelta) {
	BLAS.gemm(1.0, delta, false, this.w, true, 0., prevDelta);
}

gemm函数作用是 C := alpha * A * B + beta * C . 这里可能是以为 b 小，就省略了计算。（存疑，如果有哪位朋友知道原因，请告知，谢谢）

public static void gemm(double alpha, DenseMatrix matA, boolean transA, DenseMatrix matB, boolean transB, double beta, DenseMatrix matC)

在这里是：1.0 * delta * this.w + 0 * prevDelta。delta 不转置，this.w 转置

FuntionalLayerModel

这里是导数 * delta，导数就是变化率。

public void computePrevDelta(DenseMatrix delta, DenseMatrix output, DenseMatrix prevDelta) {
        for (int i = 0; i < delta.numRows(); i++) {
            for (int j = 0; j < delta.numCols(); j++) {
                double y = output.get(i, j);
                prevDelta.set(i, j, this.layer.activationFunction.derivative(y) * delta.get(i, j));
            }
        }
}

激活函数是 The sigmoid function. f(x) = 1 / (1 + exp(-x)).

public class SigmoidFunction implements ActivationFunction {
    @Override
    public double eval(double x) {
        return 1.0 / (1 + Math.exp(-x));
    }

    @Override
    public double derivative(double z) {
        return (1 - z) * z; // 这里
    }
}

3.1.2.4 计算梯度

这里是从前往后计算，最后累计在cumGrad。

int offset = 0;
for (int i = 0; i < layerModels.size(); i++) {
    DenseMatrix input = i == 0 ? data : outputs.get(i - 1);
    if (i == layerModels.size() - 1) {
    	layerModels.get(i).grad(null, input, cumGrad, offset);
    } else {
    	layerModels.get(i).grad(deltas.get(i), input, cumGrad, offset);
    }
    offset += layers.get(i).getWeightSize();
}

AffineLayerModel

因为导数有两部分: w , b, 所以这里有分为两部分计算。unpack 是为了解压缩，pack目的是最后L-BFGS是必须用向量来计算。

    public void grad(DenseMatrix delta, DenseMatrix input, DenseVector cumGrad, int offset) {
        unpack(cumGrad, offset, this.gradw, this.gradb);
        int batchSize = input.numRows();
        // 计算w
        BLAS.gemm(1.0, input, true, delta, false, 1.0, this.gradw);
        if (ones == null || ones.size() != batchSize) {
            ones = DenseVector.ones(batchSize);
        }
        // 计算b
        BLAS.gemv(1.0, delta, true, this.ones, 1.0, this.gradb);
        pack(cumGrad, offset, this.gradw, this.gradb);
    }

FuntionalLayerModel

这里就没有梯度计算‘

public void grad(DenseMatrix delta, DenseMatrix input, DenseVector cumGrad, int offset) {}

最后类变量如下：

this = {FeedForwardModel@10394} 

    layers = {ArrayList@10405}  size = 4
     0 = {AffineLayer@10539} 
     1 = {FuntionalLayer@10377} 
     2 = {AffineLayer@10540} 
     3 = {SoftmaxLayerWithCrossEntropyLoss@10541} 

    layerModels = {ArrayList@10401}  size = 4
     0 = {AffineLayerModel@10543} 
     1 = {FuntionalLayerModel@10376} 
     2 = {AffineLayerModel@10544} 
     3 = {SoftmaxLayerModelWithCrossEntropyLoss@10398} 

     outputs = {ArrayList@10399}  size = 4
      0 = {DenseMatrix@10374} "mat[19,5]:
  0.5258035858891295,0.40832346939250874,0.4339942803542127,0.4146645474481978,0.45503123177429533..."
      1 = {DenseMatrix@10374} "mat[19,5]:
  0.5258035858891295,0.40832346939250874,0.4339942803542127,0.4146645474481978,0.45503123177429533..."
      2 = {DenseMatrix@10533} "mat[19,3]:
  0.31968260294191225,0.3305393733681367,0.3497780236899511..."
      3 = {DenseMatrix@10533} "mat[19,3]:
  0.31968260294191225,0.3305393733681367,0.3497780236899511...."
         
     deltas = {ArrayList@10400}  size = 3
      0 = {DenseMatrix@10375} "mat[19,5]:
  0.0052001689807435756,-0.002841992490130668,0.02414893572802383."
      1 = {DenseMatrix@10379} "mat[19,5]:
  0.02085622230356763,-0.011763437253154471,0.09830897540282763,-0.005205953747031061."
      2 = {DenseMatrix@10528} "mat[19,3]:
  -0.6803173970580878,0.3305393733681367,0.3497780236899511."

3.2 CalDirection 计算方向

这里的实现没有用到目标函数的拓扑模型。

3.3 CalcLosses 计算损失

会在 objFunc.calcSearchValues 中就直接进入到了 AnnObjFunc 类内部。

计算损失代码如下：

for (Tuple3<Double, Double, Vector> labelVector : labelVectors) {
    for (int i = 0; i < numStep + 1; ++i) {
		losses[i] += calcLoss(labelVector, stepVec[i]) * labelVector.f0;
    }
}

AnnObjFunc 的 calcLoss代码如下，可见是调用其拓扑模型来完成计算。

protected double calcLoss(Tuple3<Double, Double, Vector> labledVector, DenseVector coefVector) {
        if (topologyModel == null) {
            topologyModel = topology.getModel(coefVector);
        } else {
            topologyModel.resetModel(coefVector);
        }
        Tuple2<DenseMatrix, DenseMatrix> unstacked = stacker.unstack(labledVector);
        return topologyModel.computeGradient(unstacked.f0, unstacked.f1, null);
}

这里调用的是 computeGradient 来计算损失，会提前返回

@Override
public double computeGradient(DenseMatrix data, DenseMatrix target, DenseVector cumGrad) {
        outputs = forward(data, true);
        ......
        AnnLossFunction labelWithError = (AnnLossFunction) this.layerModels.get(L);
        double loss = labelWithError.loss(outputs.get(L), target, deltas.get(L - 1));
        if (cumGrad == null) {
            return loss; // 这里计算返回
        }
        ...
}

3.4 UpdateModel 更新模型

这里没有用到目标函数的拓扑模型。

0x04 输出模型

多层感知机比普通算法更耗费内存，我需要再IDEA中增加VM启动参数，才能运行成功。

-Xms256m -Xmx640m -XX:PermSize=128m -XX:MaxPermSize=512m

这里要小小吐槽一下Alink，在本地调试时候，没有办法修改Env的参数，比如心跳时间等。造成了调试的不方便。

输出模型算法如下：

        // output model
        DataSet<Row> modelRows = weights
            .flatMap(new RichFlatMapFunction<DenseVector, Row>() {
                @Override
                public void flatMap(DenseVector value, Collector<Row> out) throws Exception {
                    List<Tuple2<Long, Object>> bcLabels = getRuntimeContext().getBroadcastVariable("labels");
                    Object[] labels = new Object[bcLabels.size()];
                    bcLabels.forEach(t2 -> {
                        labels[t2.f0.intValue()] = t2.f1;
                    });

                    MlpcModelData model = new MlpcModelData(labelType);
                    model.labels = Arrays.asList(labels);
                    model.meta.set(ModelParamName.IS_VECTOR_INPUT, isVectorInput);
                    model.meta.set(MultilayerPerceptronTrainParams.LAYERS, layerSize);
                    model.meta.set(MultilayerPerceptronTrainParams.VECTOR_COL, vectorColName);
                    model.meta.set(MultilayerPerceptronTrainParams.FEATURE_COLS, featureColNames);
                    model.weights = value;
                    new MlpcModelDataConverter(labelType).save(model, out);
                }
            })
            .withBroadcastSet(labels, "labels");

// 当运行时候，参数如下：
value = {DenseVector@13212} 
     data = {double[43]@13252} 
      0 = -39.6567702949874
      1 = 16.74206842333768
      2 = 64.49084799006972
      3 = -1.6630682281137472
  ......

其中模型数据类定义如下

public class MlpcModelData {
    public Params meta = new Params();
    public DenseVector weights;
    public TypeInformation labelType;
    public List<Object> labels;
}

最终模型数据大致如下：

model = {Tuple3@13307} "
     f0 = {Params@13308} "Params {vectorCol=null, isVectorInput=false, layers=[4,5,3], featureCols=["sepal_length","sepal_width","petal_length","petal_width"]}"
      params = {HashMap@13326}  size = 4
       "vectorCol" -> null
       "isVectorInput" -> "false"
       "layers" -> "[4,5,3]"
       "featureCols" -> "["sepal_length","sepal_width","petal_length","petal_width"]"
     f1 = {ArrayList@13309}  size = 1
      0 = "{"data":[-39.65676994108487,16.742068271166456,64.49084741971454,-1.6630682163468897,-66.71571933711216,-75.86297804171262,62.609759182998204,-101.47431688844591,31.546529394499977,17.597934397561986,85.36235323961661,-126.30772079054803,326.2329896163572,-29.720070636859894,-180.1693204840142,47.70255002863321,-63.44460914025362,136.6269589647343,-0.6446457887679123,-81.86976832863223,-16.333532816181705,15.4253068036318,-11.297177263474234,-1.1338164486683862,1.3011810728093451,-261.50388539155716,223.36901758842117,38.01966001651569,231.51463912615586,-152.59659885027318,-79.02863627644948,-48.28342595225583,-63.63975869014504,111.98667709535484,153.39174290331553,-121.04900950767653,-32.47876659498367,137.82909902591624,-43.99785013791728,-93.99354048054636,42.85135076273807,-24.8725999157641,-17.962438639217815]}"
       value = {char[829]@13325} 
       hash = 0
     f2 = {Arrays$ArrayList@13310}  size = 3
      0 = "Iris-setosa"
      1 = "Iris-virginica"
      2 = "Iris-versicolor"

0xFF 参考

深度学习中的深度前馈网络简介

Deep Learning 中文翻译

https://github.com/fengbingchun/NN_Test

深度学习入门——Affine层（仿射层-矩阵乘积）

机器学习——多层感知机MLP的相关公式

多层感知器速成

神经网络（多层感知器）信用卡欺诈检测（一）

手撸ANN之——损失层

【机器学习】人工神经网络ANN

人工神经网络（ANN）的公式推导

[深度学习] [梯度下降]用代码一步步理解梯度下降和神经网络(ANN))

softmax和softmax loss详细解析

Softmax损失函数及梯度的计算

Softmax vs. Softmax-Loss: Numerical Stability

【技术综述】一文道尽softmax loss及其变种

深度学习之前馈神经网络（前向传播和误差反向传播）

查看全文

相关阅读:
《数据结构》张明瑞清华大学计算机科学与技术专业大二
 Android studio 快捷键
 android下的样式
 Android照片墙应用实现，再多的图片也不怕崩溃
 菜单
 Android下拉刷新完全解析，教你如何一分钟实现下拉刷新功能
 自动文本提示控件
 Android高效加载大图、多图解决方案，有效避免程序OOM
notification+service实现消息推送
 常见对话框

原文地址：https://www.cnblogs.com/rossiXYZ/p/13399527.html

Alink漫谈(十五) ：多层感知机 之 迭代优化

Alink漫谈(十五) ：多层感知机 之 迭代优化

0x00 摘要

0x01 前文回顾

1.1 基本概念

1.2 误差反向传播算法

1.3 总体逻辑

0x02 训练神经网络

2.1 初始化模型

2.2 压缩数据

2.3 生成优化目标函数

2.4 生成目标函数中的拓扑模型

2.4.1 AffineLayerModel

2.4.2 FuntionalLayerModel

2.4.3 SoftmaxLayerModelWithCrossEntropyLoss

2.4.3 最终模型

2.5 生成优化器

0x03 L-BFGS训练

3.1 CalcGradient 计算梯度

3.1.1 目标函数

3.1.2 计算梯度

3.1.2.1 计算各层的输出

AffineLayerModel.eval

FuntionalLayerModel.eval

SoftmaxLayerModelWithCrossEntropyLoss.eval

3.1.2.2 计算损失

3.1.2.3 计算delta

AffineLayerModel

FuntionalLayerModel

3.1.2.4 计算梯度

AffineLayerModel

FuntionalLayerModel

3.2 CalDirection 计算方向

3.3 CalcLosses 计算损失

3.4 UpdateModel 更新模型

0x04 输出模型

0xFF 参考

Alink漫谈(十五) ：多层感知机之迭代优化

Alink漫谈(十五) ：多层感知机之迭代优化