I was researching about hierarchical object detection, and end up reading that Yolo v3 is the state of art for that kind of tasks, besides, the inference time make it one of the best for run it on live video.
So, what I have in mind, is to run a pose estimation technology over the live video (LikeOpenPose), then focus only in the rectangles near the hands of the estimated pose in order to detect the object.
the previous approach sounds good, but I feel like I'm not taking advantage of the temporal features on the video, for example, YoloV3 could not be very sure that someone has a cellphone with only the rectangle of the hands, but if I add up, the movement of the estimated pose (hand near to the head for several frames), I could be more sure that he has a phone.
But I cannot find a paper, approach or something close to the idea I have on mind, so I was wondering if someone here could give me a little clue about what path should I follow.
Thanks in advance for any help!