I would consider using what's commonly referred to as a texture atlas in game and graphics programming. In this case, I might call it just an "image atlas" instead. Basically, it just takes a bunch of related small images and combines them into a single large image.
This is usually a benefit for the file size because many file systems have a minimum block size for data. Often it's around 4 kilobytes. If you have thousands of images that are all just a little over 4k, then up 50% of the space for each image is wasted. Let's say you have a sprite that represents the player's character, and it's 4 channel RGBA format and it's 36 x 36 pixels wide. It will be 5184 bytes. The next largest file block size is 8192 bytes, so you're wasting 8192 - 5184 = 3008 or 36% of the file space. So if you can put many of these images together, the percentage of waste goes way down. No single file will waste more than 4096 - 1 = 4095 bytes. So taking 1000 files that all waste on average 2k means saving 2 MB on disk space.
It's also a benefit in memory. For games, these images are usually loaded onto the graphics card. If you load each one separately, then you can easily get fragmented video memory as they're created and destroyed. So you may end up with lots of little holes in video memory which together are large enough for a bigger allocation, but since they aren't contiguous can't be used. Similar things can happen in RAM, though it's less of an issue these days given our very good virtual memory systems.
I'm not much of an expert on networking, but I imagine you might see some savings there as well, for similar reasons to the disk space example above. I don't know if there are minimum packet sizes and what the ranges are on them, but it's something that might be worth thinking about.