You might want to look into autoencoders. What they do is learn the relationship between the input and a simplified version of it, and then try to reconstruct the original version of the input.
The good news about this architecture is that it is not a supervised learning technique, so you can feed it a lot of data without having to construct both input and output.
Hopefully, if you train it right, it will learn that the parts of the image that are not important in the result are the noisy parts that you intend to remove.
Note that autoencoders had already been used to reduce noise in audio and image with great success.
Update: Here's a quick not-so-technical explanation on why they work.
In the general architecture that auto-encoder networks have, you'll usually find an input layer, a certain amount of hidden layers which usually have less nodes than the input layer, and then a certain amount of hidden layers that gradually grow to have the same amount of input layers in the output layer.

Why they learn to reproduce the input in the output layer, the data representation needs to necessarily go through the narrow middle layer -- at that point, you can consider that the input has been "compressed", usually with a lossly approach to have a representation of the data that can later on use to reproduce the input.
(Side note: this compression feature has been tested in replacement of compression algorithms but they are not so good at performing compression as particular algorithms designed for it.)
When reducing the error between input and output, the network then starts forgetting the pieces that don't add as much information to the input as other bits of data in it. For instance, if you always train these images with figures in white backgrounds, it is likely that the encoded representation of the image will not have any information about the white background itself, since it is not required to be recreated.
As such, the most common pieces will start falling apart, and with the right tuning, a general amount of noise will go away too. Uniform noise is easier to remove (for the very same reason as the white background example), but variable noise can be worked around too with the right training.