How to understand conv layer to another same conv layer in VGG16?

Question

VGG16 struct is:

Img->
Conv1(3)->Conv1(3)->Pool1(2) ==>
Conv2(3)->Conv2(3)->Pool2(2) ==>
Conv3(3)->Conv3(3)->Conv3(3)->Pool3(2) ==>
Conv4(3)->Conv4(3)->Conv4(3)->Pool4(2) ==>
Conv5(3)->Conv5(3)->Conv5(3) ====> FC

The flow : http://ethereon.github.io/netscope/#/gist/dc5003de6943ea5a6b8b

In keras code https://github.com/keras-team/keras/blob/master/keras/applications/vgg16.py

# Block 1
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1')(img_input)
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(x)


....

It looks like just pass Conv1_1 result to Conv1_2, and the rest too.

I can understand this:

Img->
Conv1(3)->Pool1(2) ==>
Conv2(3)->Pool2(2) ==>
Conv3(3)->Pool3(2) ==>
Conv4(3)->Pool4(2) ==>
Conv5(3) ====> FC

But don't understand why Conv1_1 can connect to a same layer - Conv1_2 :

For example:

Img: 224x224x3
kernel:3x3
depth:3 (channel)
Conv1_1: use 64 kenels. So there are 64 3x3x3 kernel used to scan 224x224x3(with padding 1) , then get Conv1_1_feature_map 224x224x64 .

Here, does it mean:

Conv1_2: use 64 3x3x64 kernel to scan Conv1_1_feature_map 224x224x64 , then get Conv1_2_feature_map 224x224x64 ??

If it is, could you explain what is the benifit ? I can't understand the meaning of this flow clearly. It is not clear that why Conv1(3)->Conv1(3)->Pool1(2) is better than Conv1(3)->Pool1 . On my feeling, another same size Conv looks like just make previous layer output more blurred/ambiguous, doesn't like Pooling layer concentrate previous Layer's output feature.

score 2 · Answer 1 · answered Jan 26 '18 at 22:47

to address your question about:

It is not clear that why Conv1(3)->Conv1(3)->Pool1(2) is better than Conv1(3)->Pool1

..lot of neural network design is a trial and error process and more art than science. So one experiments with your suggestion of Conv1(3)->Pool1 and fails to get the accuracy needed and experiments some more to end up with Conv1(3)->Conv1(3)->Pool1(2) that gives them the accuracy that they need.

There is seldom a real scientific explanation of why Conv1(3)->Conv1(3)->Pool1(2) is better than Conv1(3)->Pool1, but generally speaking, we need as many weights as possible to capture all the variability in data as possible, but not more (otherwise we overfit.)

Green Falcon · Answer 2 · 2018-01-26T09:00:41.270

If I've got the meaning of question, first convolution accepts inputs of size 224*224*3 means that height and width are both 224 and the size of depth or number of activation maps, channels here, is equal to three. The output of this layer will be activation maps with equal height and width as the input, 224, because it is same convolution. The number of activation maps for this layer which is going to be passed to the next layer is equal to 64 because you have 64 filters in this layer. The point is that each filter is of size 3*3*3 to fit to the input. The output of each filter is an activation map of size 224*224*1. The output of filters come together and construct output of size 224*224*64 which means the input of the next layer will have 64 channels, actually depth here. Consequently, the filters of the second convolution layer is of size 3*3*224 to match the entire input of the previous layer. In other words, you will have 64 filters of size 3*3*64 and each will have an output of size 224*224*1. Take a look at here which definitely can help you.

The purpose of writers of VGG net was to make a network which was just deep enough to perform well on ImageNet data-set. CNNs have lots of hyper parameters which setting them is not based on well behaved math stuff. I mean there is no proof to show which hyper parameter is better than the others. They are found based on experience. They are gained practically. Writers of this paper tried to show that if you use same hyper parameters for convolution layers and just make the network deep and deeper, it was much deeper than AlexNet, you will gain good performance without caring about different set of hyper parameters. For understanding what ConvNets do, there is already an answer here which may help you, it contains the interpretation of different layers in CNNs. They have reached to this setting of convolution layers and architecture by experience and there idea was too use just a deep net without complicated hyper parameters.

Yes, that's the flow , but I can't understand the meaning of this flow clearly. It is not clear that why Conv1(3)->Conv1(3)->Pool1(2) is better than Conv1(3)->Pool1 . On my feeling, another same size Conv looks like just make previous layer output more blurred/ambiguous, doesn't like Pooling layer concentrate previous Layer's output feature. — Mithril, Jan 26 '18 at 07:01
@Mithril I updated, feel free to ask if you don't understand. — Green Falcon, Jan 26 '18 at 09:02

How to understand conv layer to another same conv layer in VGG16?

2 Answers2