I'm trying to understand Single Shot Multibox Detection following a book adopted at 500 universities from 70 countries
The complete single shot multibox detection model consists of five blocks. The feature maps produced by each block are used for both (i) generating anchor boxes and (ii) predicting classes and offsets of these anchor boxes. Among these five blocks, the first one is the base network block, the second to the fourth are downsampling blocks, and the last block uses global max-pooling to reduce both the height and width to 1. Technically, the second to the fifth blocks are all those multiscale feature map blocks in Fig. 14.7.1.
the following is Fig. 14.7.1. mentioned in the above excerpt
I'm aware that the second to the fifth blocks are all those multiscale feature map blocks. My only concern is whether the base network is also a feature map block. I guess it is though the book doesn't convey the idea explicitly. So I guess I need a double confirm.