Locality sensitive hashing is a great tool for this problem.
Pick n random 400-dimensional vectors. (Be careful or not all directions will be chosen with equal probability; pick each dimension as a standard Gaussian.) Each really defines a hyperplane through the origin cutting your space in half. The sign of the dot product of any of these vectors with some new vector tells you which side of the hyperplane it's on. So computing n dot products gives n 0/1 bits, which make an n-bit hash.
Any new vector hashing to the same value must be in the same small sliver of space from the origin. And those are exactly the vectors that have a high cosine similarity with each other since their mutual angles are very small. Likewise anything hashing to almost the same value -- differing in a few bits -- is likely to be nearby. So you can restrict your search for most-similar vectors to things within one or more buckets of hashed candidate vectors.
It doesn't help directly with memory since you might need any particular bucket to satisfy a request. You also lose some accuracy since there is not a guarantee the most similar vectors lie in the buckets you examine (though it's likely, the more you examine). It lets you trade speed for accuracy mostly. However you may find you can get away with some caching scheme, where some buckets are rarely if ever accessed and so don't stay in memory.
You can see an implementation of this in Oryx, which I think is pretty straightforward.
Most of the complexity comes because it lets you specify a target percentage of vectors to evaluate, and works out the optimal hash size based on that and your machine's number of cores.