What methods are used to localize an object in an video and classify that object?
Example: I have a camera which detects an pickup truck driving into a garage of three (1,2,3). In need to know if the truck was loaded or not (classification) and which garage it picked (localization). How would a schematic workflow of this problem look like?
It is assumed that the camera is mounted in a fixe position.