Machine Learning Interview

zoukankan html css js c++ java

Machine Learning Interview

I do NOT update this article any more. Notes of this kind will be kept personal starting today.

Q1: Assuming that we train the neural network with the same amount of training examples, how to set the optimal batch size and number of iterations? (where batch size * number of iterations = number of training examples shown to the neural network, with the same training example being potentially shown several times)

It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. Large-batch methods tend to converge to sharp minimizers of the training and testing functions -- and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers. Large-batch methods are almost invariably attracted to regions with sharp minima and that, unlike small batch methods, are unable to escape basins of these minimizers.

When the slope is too steep, the gradient descent step can actually move uphill. Make a second-order Taylor series approximation to the cost function f(x):

Take the gradient descent step,

Q2: why do we need orthogonalization in neural network training?

With acknowledgment to Week 1, Course 2, deeplearning.ai

Q3: why do we use logistic regression?

The classification also requires linear regression values to fall between 0 and 1. The normalized error assumption in linear regression is not applicable.

Q4: Derive multinomial regression

Thanks to https://en.wikipedia.org/wiki/Multinomial_logistic_regression

Q5: why is gradient descent with momentum usually better than the plain gradient descent?

The gradient descent with momentum performs a weighted average of recently computed gradients and removes oscillation, so it approaches the optimal point faster.

Q6: The difference between max pooling and average pooling?

we perform pooling to increase invariance, reduce computation complexity (as 2*2 max pooling/average pooling reduces 75% data) and extract low-level features from the neighbourhood.

Max pooling extracts the most important features like edges whereas, average pooling extracts features smoothly. In global average pooling, a tensor with dimensions $h \times w \times d$

Q7: why do we use weight decay?

To avoid over-fitting, it is possible to regularize the cost function. An easy way to do that is by introducing a zero mean Gaussian prior over the weights, which is equivalent to changing the cost function to

And the new step will be

The new term coming from the regularization causes the weight to decay in proportion to its size.

Q8: what is a deconvolutional layer?

Deconvolution Layer is a very unfortunate name and is simply padding more zeros. For example, if we process a 2*2 patch that is padded with 2 zeros both on the left and on the right with a 3*3 kernel and stride 1, the patch becomes [(2+2*2-3)/1+1]=4. for visualizations see https://datascience.stackexchange.com/questions/6107/what-are-deconvolutional-layers

Q9: Why don't we use mse for classification?

MSE after Softmax causes vanishing gradients

查看全文

相关阅读:
【CentOS 7】关于php留言本网站的搭建
 linux系统的初化始配置(临时生效和永久生效)
时间同步ntp服务的安装与配置（作为客户端的配置）
CentOS 7设置服务的开机启动
 辅助模型——通信图
 一.面向对象概论
 辅助模型——包图
 构建图
 部署图
 辅助模型——状态机图

原文地址：https://www.cnblogs.com/cxxszz/p/8566391.html