Have a close look at the following image:-

The purple bar to the immediate left of Tux's right eye is 6 pixels by 1 pixel. That's equivalent to the 'coverage' of a single AES block when encrypting an RGB image (24 bits/pixel).
Since the AES key is fixed, and there is no relationship between adjacent blocks, the only variable is the so called plain text, e.g. Tux. So the 6 pixel strip is microscopically scrambled (intra-block diffusion), but an impression of Tux persists when viewed macroscopically (no inter-block diffusion). If the background is constant, the encrypted image part is constant.
And the relative distortion will decrease as the test image gets bigger. This is the nub of your question. I suggest the original Tuxes were much much bigger than the image you've linked to. That explains why the plain Tux and cypher Tux are virtually the same outline. As you suggest, you would expect to see a much more jagged edge to cipher Tux as the 6 pixel block shreds his outline. Especially the black parts which are quite a solid colour. This is not really apparent, thus I can only conclude the encryption was performed on a much larger portrait and then shrunk for posting. Internet lore.
This illustrates the need for some 'mode' of operation for a block primitive. That will create a relationship between adjacent blocks scrambling poor Tux into an indistinguishable coloured square, like so:-
