But is it implemented that way, it performs some operations in parallel?
The speed tests used with Salsa20 assume a single core.
On the other hand, those speed tests predate AVX512. I expect that, if the code was rewritten to take advantage of those instructions, it should go significantly faster.
In addition, Salsa20 uses counter mode to encrypt; hence it could be parallelized (with separate cores encrypting separate parts of the plaintext); assuming you have a long plaintext message to encrypt, you can use as many cores as you think appropriate. I personally suspect that, unless the message is absolutely huge, the time needed to synchronize the various threads would defeat the parallelization gain, and you'd be better off using the various threads to do different tasks.