zoukankan      html  css  js  c++  java
  • Machine Learning Interview

     I do NOT update this article any more. Notes of this kind will be kept personal starting today.

    Q1: Assuming that we train the neural network with the same amount of training examples, how to set the optimal batch size and number of iterations? (where batch size * number of iterations = number of training examples shown to the neural network, with the same training example being potentially shown several times)

    It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. Large-batch methods tend to converge to sharp minimizers of the training and testing functions -- and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers. Large-batch methods are almost invariably attracted to regions with sharp minima and that, unlike small batch methods, are unable to escape basins of these minimizers.

    When the slope is too steep, the gradient descent step can actually move uphill. Make a second-order Taylor series approximation to the cost function f(x):

    Take the gradient descent step,

    Q2: why do we need orthogonalization in neural network training?

    With acknowledgment to Week 1, Course 2, deeplearning.ai

     Q3: why do we use logistic regression?

    The classification also requires linear regression values to fall between 0 and 1. The normalized error assumption in linear regression is not applicable.

    Q4: Derive multinomial regression

    Thanks to https://en.wikipedia.org/wiki/Multinomial_logistic_regression

    Q5: why is gradient descent with momentum usually better than the plain gradient descent?

    The gradient descent with momentum performs a weighted average of recently computed gradients and removes oscillation, so it approaches the optimal point faster.

    Q6: The difference between max pooling and average pooling?

    we perform pooling to increase invariance, reduce computation complexity (as 2*2 max pooling/average pooling reduces 75% data) and extract low-level features from the neighbourhood. 

    Max pooling extracts the most important features like edges whereas, average pooling extracts features smoothly. In global average pooling, a tensor with dimensions h×w×is reduced in size to have dimensions 1×1×d. 

    Q7: why do we use weight decay?

    To avoid over-fitting, it is possible to regularize the cost function. An easy way to do that is by introducing a zero mean Gaussian prior over the weights, which is equivalent to changing the cost function to 

    And the new step will be 

    The new term coming from the regularization causes the weight to decay in proportion to its size.

    Q8: what is a deconvolutional layer?

    Deconvolution Layer is a very unfortunate name and is simply padding more zeros. For example, if we process a 2*2 patch that is padded with 2 zeros both on the left and on the right with a 3*3 kernel and stride 1, the patch becomes [(2+2*2-3)/1+1]=4. for visualizations see https://datascience.stackexchange.com/questions/6107/what-are-deconvolutional-layers

    Q9: Why don't we use mse for classification?

    MSE after Softmax causes vanishing gradients

  • 相关阅读:
    TreeSet和TreeMap中“相等”元素可能并不相等
    求众数——摩尔投票
    5802. 统计好数字的数目
    快速幂
    LCP 07.传递消息
    332. 重新安排行程(欧拉回路问题)
    126. 单词接龙 II
    879. 盈利计划
    287. 寻找重复数
    239. 滑动窗口最大值
  • 原文地址:https://www.cnblogs.com/cxxszz/p/8566391.html
Copyright © 2011-2022 走看看