sec. 1: Data Augmentation
- horizontally flipping, random crops and color jittering
- fancy PCA
sec. 2: Pre-Processing
- zero center and normailze
X -= np.mean(X, axis=0) #zero center
X /= np.std(X, axis=0)
- PCA Whitening
X -= np.mean(X, axis=0) # zero-center
cov = np.dot(X.T, X) / X.shape[0]
U,S,v = np.linalg.svd(cov)
Xrot = np.dot(X, U)
Xwhite = Xrot / np.sqrt(S+1e-5)
sec. 3: Initialization
-
all zero initializationall the neurons compute the same gradients.
-
Initialization with Small Random Numbers
weights ~ 0.001 * N(0,1)
-
Calibrating the Variances
the outputs from a randomly initialized neuron has a variance grows with the number of inputs
[egin{align} Var(X) &= E(X^2)-E(X)^2\ Var(s) &= Var(sum_{i=1}^nw_ix_i)\ &= sum_{i=1}^n{E(w_i^2x_i^2)-E(w_ix_i)^2}\ &= sum_{i=1}^n{E(w_i^2)E(x_i^2)-E(w_ix_i)^2}\ &= sum_{i=1}^n{Var(w_i)Var(x_i)-2E(w_ix_i)^2+E(w_i^2)E(x_i)^2+E(w_i)^2E(x_i^2)}\ &= sum_{i=1}^n{Var(w_i)Var(x_i)}\ &= nVar(w)Var(x) end{align} ]
w = np.random.randn(n)/sqrt(n) # n: the number of inputs
- Current Recommendation
an initialization specifically for Relus:
w = np.random.randn(n)*sqrt(2.0/n)
Sec. 4: During Training
- Learning rate: divide the LR by 2 (or by 5)
- Fine-tune on pre-trained models on your own data
very similar dataset | very different dataset | |
---|---|---|
very little data | Use linear classification on top layer | Try linear classification from different stages |
quite a lot of data | Finetune a few layers | Finetune a large number of layers |
Sec. 5: Activation Functions
-
Sigmoid
Cons: Sigmoids saturate and kill gradients & outputs are not zero centered
-
tanh
Cons: saturate and kill gradients
-
Rectified Linear Unit
Pros: Comutationally & non-saturating form Cons: dying ReLU
-
Leaky ReLU
-
Parametric ReLU
-
Randomized ReLU
Sec. 6: Regularization
- L2 regularization : heavily penalizing peaky weight vectors and preferring diffuse weight vectors
- L1 regularization : explicit feature selection
- Max norm constraints: enforce an absolute upper bound on the magnitude of the weight vector
- Dropout: sampling a Neural Network with the full Neural Network
Sec. 7: Insights from Figures
- The loss curve: linear - low learning rate; doesn't decrease much - high learning rate
- accuracy curve: big gap - increase regularization no gap - increase model capacity
Sec. 8: Ensemble
- Same model, different initialization
- Top models discovered during cross-validation
- Different checkpoints of a single model
- early fusion & late fusion