zoukankan      html  css  js  c++  java
  • Convolution Fundamental I

    Convolution Fundamental I

    Foundations of CNNs

    Learning to implement the foundational layers of CNN's (pooling,convolutions) and to stack them properly in a deep network to solve multi-class image classification problems.

    Computer vision

    Computer vision is from the applications that are rapidly active thanks to deep learning

    One of the applications of computer vision that are using deep learning includes:

        Self driving cars

        Face recognition

    Deep learning also is making new arts to be created to in computer vision as we will see.

    Rabid changes to computer vison are making new applications that weren't possible a few years ago.

    Computer vison deep leraning techniques are always evolving making a new architectures which can help us in other areas other than computer vision

        For example, Andrew Ng took some ideas of computer vision and applied it in speech recognition

    Examples of a computer vision problems includes:

        Image classification

        Object detection

            Detect object and localize them

        Neural style transfer

            Changes the style of an image using another image.

    On of the challenges of computer vision problem that images can be so large and we want a fast and accurate algorithm to work with that.

        For example,a 1000*1000 image will represent 3 million feature/input to the full connected neural network. If the following hidden layer contains 1000, Then we we will want to learn weight of the shape [1000,3 million] which is 3 billion parameter only in the first layer and that's so computationally expensive!

    On of the solutions is to build this using convolution layers instead of the fully connected layers.

    Edge detection example

    The convolution operation is one of the fundamentals blocks of a CNN. One of the examples about convolution is the image edge detection operation.

    Early layers of CNN might detect edges then the middle layers will detect parts of objects and the layers will put the these parts together to produce an output

    In an image we can detect vertical edges, horizontal edges,or full edge detector

    Vertical edge detection

        An example of convolution operation to detect vertical edges:

        In the last exmple a 6*6 matrix convolved with 3*3 filter/kernel gives us a 4*4 matrix

        If you make the convolution operation in TensorFlow you will fin the function tf.nn.conv2d. In keras you will fin conv2d function.

        The vertical edge detection filter will find a 3*3 place in an image where there are a bright region followed by a dark region

        If we applied this filter to a white region followed by a dark region,it should find the edges in between the two colors as a positive value. But if we applied the same filter to a dark region followed by a white region it will give us negative values. To solve this we can use the abs function to make it positive.

    Horizontal edge detection

        Filter would be like this

    There are a lot of ways we can put number inside the horizontal of vertical edge detections. For example here are the vertical Sobel filter(The idea is taking care of the middle row)

    Also something called Scharr filter(The idea is taking great care of the middle row)

    What we learned in the deep learning is that we don't need to hand craft these numbers, we can treat them as weights and then learn them. It can learn horizontal, vertical ,angled, or any edge type automatically ranther than getting them by hand.

    Padding

    In order to use deep neural networks we really need to use paddings

    In the last section we saw that a 6*6 matrix convolved with 3*3 filter/kernel gives us a 4*4 matrix.

    To give it a general rule, if a matrix n*n is convolved with f*f filter/kernel give us n-f+1,n-f+1 matrix.

    The convolution operation shrinks the matrix if f>1

    We want to apply convolution operation multiple times, but if the image shrinks we will lose a lot of data on this process. Also the edges pixels are uses less than other pixels in an image.

    So the problems with convolutions are:

        shrinks output

        throwing away a lot of information that are in the edges.

    To solve these problems we can pad the input image before convolution by adding some rows and columns to it. We will call the padding amount p the number of row/columns that we will insert in top, bottom , left and right of the image.

    In almost all the cases the padding values are zeros

    The general rule now, if a matrix n*n is convolved with f*f filter/kernel and padding p give us n+2p-f+1,n+2p-f+1 matrix

    If n=6,f=3, and p=1 The n the output image will have n+2p-f+1=6+2-3+1=6. We maintain the size of the image.

    Same convolutions is a convolution with a pad so that output size is the same as the input size. Its given by the equation:

    In computer vision f is usually odd. Some of the reasons is that its have a center value.

    Strided convolution

    Strided convolution is another piece that are used in CNNs

    We will call stride s

    When we are making the convolution operation we used s to tell us the number of pixels we will jump when we are convolving filter/kernel. The last examples we described s was 1

    Now the general rule are:

        if a matrix n*n is convolved with f*f filter/kernel and padding p and stride s it give us (n+2p-f)/s+1, (n+2p-f)/s+1 matrix

    In case (n+2p-f)/s+1 is fraction we can take floor of this value.

    In math textbooks the conv operation is filpping the filter before using it. What we were doing is called cross-correlation operation but the state of art of deep learning is using this as conv operation.

    Same convolutions is a convolution with a pad so that output size is the same as the input size. Its given by the equation:

    Convolution over volumes

    We see how convolution works with 2D images, now lets see if we want to convolve 3D iamges(RGB image)

    We will convolve an image of height,width,# of channels with a filter of a height, width,same # of cahnnels. Hint hat the image number channels and the filter number of channels are the same.

    We can call this as stacked filters for each channel!

    Example

        input image: 6*6*3

        Filter:3*3*3

        Result image: 4*4*1

        In the last result p=0,s=1

    Hint the output here is only 2D

    We can use multiple filters to detect multiple features or edges. Example

        Input image: 6*6*3

        10 Filters: 3*3*3

        Result image: 4*4*10

        In the last result p=0,s=1

    One Layer of a Convolutional Network

    First we convolve some filters to a given input and then add a bias to each filter output and then get RELU of the result. Example:

        Input iamge: 6*6*3 # a0

        10 Filters: 3*3*3 #w1

        Result image: 4*4*10 #w1a0

        Add b(bias) with 10*1 will get us: 4*4*10 image #w1a0+b

        Apply RELU will get us: 4*4*10 image #A1=RELU(w1a0+b)

        In the last result p=0,s=1

        Hint number of parameters here are; (3*3*3*10)+10=280

    The last example forms a layer in the CNN

    Hint that no matter how the size of the input, the number of the parameters for the same filter will still the same. That makes it less prune to overfitting

    Here are some notation we will use. If layer l is a conv layer:

    A simple convolution network example

    Lets build a big example.

        Input Image are: a0=39*39*3

            n0=39 and nc0=3

        First layer(Conv layer):

            f1=3,s1=1,and p1=0

            number of filters=10

            Then output are a1=37*37*10

                n1=37 and nc1-10

        second layer(Conv layer):

            f2=5,s2=2,p2=0

            number of filters=20

            The output are a2=17*17*20

                n2=17,nc2=20

            Hint shrinking goes much faster because the stride is 2

        Third layer(Conv layer):

            f3=5,s3=2,p2=0

            number of filters=40

            The output are a3=7*7*40

                n3=7,nc3=40

        Forth layer(Fully connected softmax)

            a3=7*7*40=1960 as a vector.

    In the last example you seen that the image are getting smaller after each layer and that's the tread now.

    Typesof layer in a convolutional network:

    Convolution. #Conv

        Pooling #Pool

        Fully connected #FC

    Pooling layers

    Other than the conv layers,CNNs often uses pooling layers to reduce the size of the inputs, speed up computation, and to make some of the features it detects more robust.

    Max pooling example:

    This example has f=2,s=2 and p=0 hyperparameters

    The max pooling is saying, if the feature is detected anywhere in this filter then keep a high number. But the main reason why people are using pooling because its works well in practice and reduce computations.

    Max pooling has no parameters to leran

    Example of Max pooling on 3D input:

        Input: 4*4*10

        Max pooling size=2 and stride=2

        output 2*2*10

    Average pooling is taking the averages of the values instead of taking the max values

    Max pooling is used more often than average pooling in practice.

    If stride of pooling equals the size, it will then apply the effect of shrinking.

    Hyperparameters summary

        f: filter size

        s: stride

        Padding are rarely uses here

        Max or average pooling

    Convolutional neural network example

    Now we will deal with a full CNN example. This example is something like the LeNet-5 that was invented by Yann Lecun

        Input image are: a2=32*32*3

            n0=32 and nc0=3

        First layer(Conv layer): #Conv1

            f1=5,s1=1,and p1=0

            number of filters=6

            Then output are a1=28*28*6

                n1=28,and nc1=6

            Then apply(Max pooling): #Pool1

                f1p=2 and s1p=2

                The output are a1=14*14*16

        Second layer(Conv layer):#Conv2

            f2=5,s2=1,p2=0

            number of filters=16

            The output are a2=10*10*16

                n2=10,nc2=16

            Then apply(Max pooling):#pool2

                f1p=2,and s1p=2

                The output are a2=5*5*16

            Third layer(Fully connected) #FC3

                Number of neurous are 120

                The output a3=120*1, 400 came from 5*5*16

            Forth layer(Full connected) #FC4

                Number of neurons are 84

                The output a4=84*1

            Fifth layer(Softmax)

                Number of neurons is 10 if we need to identify for example the 10 digits

    Hint a Conv1 and Pool1 is treated as one layer

    Some statistics about the last example:

    Hyperparameters are a lot. For choosing the value of each you should follow the guideline that we will discuss later or check the literature and takes some ideas and numbers from it.

    Usually the input size decrease over layers while the number of filters incerease

    A CNN usually consists of one or more convolution(Not just one as the shown examples) folowed by a pooling.

    Fully connected layers has the most parameters in the network

    To consider using these bolocks together you should look at other working examples firsts to get some intuitions

    Why convollutions?

    Two main advantages of Convs are:

        Parameter sharing.

            A feature detector(such as a vertical edge detector) that's useful in one part of the

    image is probably useful in another part of the image

        sparsity of connection.

            In each layer, each output value depends only on a small number of inputs which

    makes it translation

    Putting it all together

    Deep convolutional models: case studies

    Learn about the practical tricks and methods used in deep CNNs straight from the research paper.

    Why look at case studies?

    We learned about Conv layer, pooling layer, and fully connected layers. It turns out that computer vision researchers spent the past few years on how to put these layers together.

    To get some intuitions you have to see the examples that has been made.

    Some neural networks architecture that works well in some tasks can also work well in other tasks.

    Here are some classical CNN networks:

        LeNet-5

        AlexNet

        VGG

    The best CNN architecture that won the last ImageNet competition is called ResNet and it has 152 layers!

    There are also an architecture called Inception that was made by Google that are very useful and apply to your tasks.

    Reading and trying the mentioned models can boost you and give you a lot of ideas to solve your task.

    Classic networks

    In this section we will talk about classic networks which are LeNet-5,AlexNet, and VGG

    LeNet-5

    The goal for this model was to identify handwritten digits in a 32*32*1 gray image. Here are the drawing of it:

    This model was published in 1998. The last layer wasn't using softmax back then

    It has 60K parameters.

    The dimensions of the image decreases as the number of channel s increases.

    ConvèPoolèConvèPoolèFCèFCèsoftmax this type of arrangement is quite common.

    The activation function used in the paper was Sigmoid and Tanh. Modern implementation uses RELU in most of the cases.

    [LeCun et al., 1998. Gradient-based learning applied to document recognition]

    AlexNet

    Named after Alex Krizhevsky who was the first author of this paper. The other authors includes Jeoffery Hinton.

    The goal for the model was the ImageNet challenge which classifies images into 1000 classes. Here are the drawing of the model:

    Summary:

    ConvèMax-poolèConvèMax-poolèConvèConvèConvèMax-poolèFlattenèFCèFCèSoftmax

    Similar to LeNet-5 but bigger.

    Has 60 Million parameter compared to 60K parameter of LeNet-5

    It used the RELU activation function.

    The original paper contains Multiple GPUs and Local Response normalization(RN)

        Multiple GPUs was used because the GPUs was so fast back then.

        Researchers proved that Local Response normalization doesn't help much so far now

        don't bother yoursef for understanding or implementing it.

    This paper convinced the computer vision researchers that deep learning is so important.

    VGG-16

    A modification for AlexNet.

    Instead of having a lot of hyperparameters lets have some simpler network.

    Focus on having only these blocks:

        CONV=3*3 filter, s=1, same

        MAX-POOL=2*2,s=2

    Here are the architecture:

    This network is large even by modern standards. It has around 138 million parameters.

        Most of the paramters are in the fully connected layers.

    It has a total memory of 96MB per image for only forward propagation!

        Most memory are in the earlier layers.

    Number of filters increases from 64 to 128 to 256 to 512, 512 was made twich.

    Pooling was the only one who is responsible for shrinking the dimensions.

    There are another version called VGG-19 which is bigger version. But most people used the VGG-16 instead of the VGG-19 because it does the same.

    VGG paper is attractive it tries to make some rules regarding using CNNs

    Special Netwroks

    Residual Networks(ResNets)

    Very, very deep NNs are difficult to train because of vanishing and exploding gradients problems.

    In this section we will learn about skip connection which makes you take the activation from one layer and suddenly feed it to another layer even much deeper in NN which allows you to train large NNs even with layers greater than 100.

    Residual block

    ResNets are built out of some Residual blocks.

    They add a shorcut/skip connection before the second activation.

    The authors of this block find that you can train a deeper NNs using stacking this block.

    Residual Network

    Are a NN that consists of some Residual blocks.

    These networks can go deeper without hurting the performance. In the normal NN –Plain networks- the theory tell us that if we go deeperwe will get a better solution to our problem. but because of the vanishing and exploding gradients problems the performance of the network suffers as it goes deeper. Thanks to Residual Network we can go deeper as we want now.

    On the left is the normal NN and on the right are the ResNet. As you can see the performance of RestNet increases as the nwtwork goes deeper.

    In some cases going deeper won't effect the performance and that depends on the problem on your hand.

    Some people are trying to train 1000 layer now which isn't used in practice

    Why ResNets work

    Lets see some example that illustrates why resNet work.

        We have a big NN as the following:

            XàBig NNàa[l]

        Lets add two layers to this network as a residual block:

            XàBig NNàa[l]àLayer1àLayer2àa[l+2]

            And a[l] has a direct connection to a[a+2]

        Suppose we are using RELU activations:

        Then:

            a[l+2]=g(z[l+2]+a[l])=g(w[l+2]a[l+1]+b[l+2]+a[l])

    Then if we are using L2 regularization for example, w[l+2] will be zero. Lets say that b[l+2] will be zero too.

    Then a[l+2]=g(a[l])=a[l] with no negative values.

    This show that identity function is easy for a residual block to learn. And that why if can train deeper NNs.

    Also that the two layers we added doesn't hurt the performance of big NN we made.

    Hint: dimensions of z[l+2] and a[l] have to be the same in resNets. In case they have different dimension what we put a matrix parameters(Which can be learned or fixed)

        a[l+2]=g(z[l+2]+ws*a[l]) #The added Ws should make the dimentions equal

        ws also can be a zero padding

    Using a skip-connection helps the gradient to backpropagate and thus heps you train deeper networks

    Lets take a look at ResNet on images.

        Here are the architecture of ResNet-34:

    All the 3*3 Conv are same Convs

    Keep it simple in design of the network

    spatial size/2è #filters*2

    No FC layers,No dropout is used

    Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same of different. You are going to implement both of them.

    The dotted lines is the case when the dimensions are different. To solve then they down sample the input by 2 and then pad zeros to match the two dimensions. There's another trick which is called bottleneck which we will explore later.

    Useful concept(Spectrum of Depth)

    Residual blocks types:

    Identity block:

    Hint the conv is followed by a batch norm BN before RELU. Dimensions here are the same.

        This skip is over 2 layers. The skip connection can jump n connection where n>2

        The convolutional block:

        The conv can be bottleneck 1*1 conv

    Network in Network and 1*1 convolutions

    A 1*1 convolution- we also call it Network in Network- is so useful in many CNN models.

    What does a 1*1 convolution do? Isn't it just multiplying by a number?

        Let's first consider an example:

            Input: 6*6*1

            Conv:1*1*1 one filter. # The 1*1 Conv

            Output: 6*6*1

        Another example:

            Input:6*6*32

            Conv:1*1*32 5 filters. # The 1*1 Conv

            Output: 6*6*5

    It has been used in a lot of modern CNN implementations likes ResNet and Inception models.

    A 1*1 convolution is sueful when:

    we want to shrink the number of channels. We also call this feature transformation.

            In the second discussed example above we have shrieked the input from 32 to 5

        We will later see that by shrinking it can save a lot of computations

        If we have specified the number of 1*1 Conv filters to be the same as the input number

        of channels then the output will contain the same number of channels. Then 1*1 Conv     will act like a non linearity and will learn non linearity operator.

    Replace fully connected layers with 1*1 convolutions as Yann LeCun believes they are the same.

        In Convolutional Nets, there is no such thing as "fully-connected layers", There are only

        convolution layers with 1*1 convolution kernel and a full connection table.

    Inception network motivation

    When you design a CNN you have to decide all the layers yourself. Will you pick a 3*3 Conv or 5*5 Conv or maybe a max pooling layer. You have so many choices.

    What inception tells us is, Why not use all of them at once?

    Inception module, naïve version:

    Hint that max-pool are same here.

    Input to the inception module are 28*28*192 and the output are 28*28*256

    We have done all the Convs and pools we might want and will let the NN learn and decide which it want to use most.

    The problem of computational cost in Inception model:

        If we have just focused on a 5*5 Conv that we have done in the last example.

        There are 32 same filters of 5*5, and the input are 28*28*192

        output should be 28*28*32

        The total number of multiples needed here are:

            Number of output*Filter size*Filter size*Input dimensions

            Which equals: 28*28*32*5*5*192=120Mil

            120Mil multiply operation still a problem in the modern day computers.

        Using a 1*1 convolution we can reduce 120mil to just 12 mil. Lets see how.

    Using 1*1 convolution to reduce computational cost:

        The new architecture are:

            X0 shape is (28,28,192)

            We then apply 16(1*1 Convolution)

            That produces X1 of shape (28,28,16)

                Hint, we have reduced the dimensions here.

            Then apply 32(5*5 Convolution)

            That produces X2 of shape(28,28,32)

        Now lets calculate the number of multiplications:

            For the first Conv: 28*28*16*1*1*192=2.5Mil

            For the second Conv: 28*28*32*5*5*16=10Mil

            So the total number are 12.5Mil approx. Which is so good compared to 120Mil

    A 1*1 Conv here is called Bottlenect BN

    It turns out that the 1*1 Conv won't hurt the performance.

    Inception module,dimensions reduction version:

    Example of inception model in Keras:

    Inception network(GoodNet)

    The inception network consist of concatenated blocks of the Inception module.

    The name inception was taken from a name image which was taken from Inception movie

    There are the full model:

    Some times a Max-pool block is used before the inception module to reduce the dimensions of the inputs.

    There are a 3 Sofmax branches at different positions to push the network toward its goal. and helps to ensure that the intermediate features are good enough to the network to learn and it turns out that softmax0 and softmax1 gives regularization effect.

    Since the development of the inception module, the authors and the others have built another versions of this network. Like inception v2,v3 and v4. Also there are a network that has used the inception module and the ResNet together.

  • 相关阅读:
    【转】【Salesfoece】Salesforce 应用生命周期管理
    【Apex】【Salesfoece】Salesforce 的 package.xml 文件
    【转】【Salesfoece】Apex计划作业框架的实现--用于实现数据的定时自动处理。
    【转】【Salesfoece】Apex 的 Trigger 类简介
    【转】【Salesforce】提高 Visualforce 页面加载效率的小知识
    「这是啥」关于三维偏序
    Javaweb中PO BO VO DTO POJO DAO DO概念理解
    Python多进程、多线程及各自的适用场景
    基于LDA主题模型和SVM的文本分类
    理解accuracy/precision_score、micro/macro
  • 原文地址:https://www.cnblogs.com/kexinxin/p/9904366.html
Copyright © 2011-2022 走看看