Parallelize Backpropagation - How to synchronize the weights of each thread?

Question

I implemented a parallel backpropagation algorithm that uses $n$ threads. Now every thread gets $\dfrac{1}{n}$ examples of the training data and updates its instance of the net with it. After every epoch the different threads share their updated weights. For that I simply add the weights of the threads and then divide each weight by the number of threads. The problem now is that the more threads I use the worse the result. For me this means that my way of synchronizing the threads is not as good as it should be.

Is there a better way to do it?

Neil Slater · Answer 1 · 2021-08-28T20:53:21.950

After a whole epoch, with multiple update steps, the neural networks in each thread will have diverged in a way where it may not make sense to take means of the weights. Ideally you should be combining data for each update step. In turn that means you will want to avoid making updates on every example, because the overhead of starting, stopping and combining the threads may lose most of the benefits.

It is common in neural networks to use mini-batches (larger than 1, smaller than the whole dataset), to get more accurate gradients, and for parallelisation. There is often a sweet spot in terms of learning speed (or sample efficiency) with some size of mini-batch. Each mini-batch calculates gradients for all examples, combines them into a mean gradient, then performs a single weight update step.

Use your threads to calculate the gradients for a mini-batch, divided up between the threads, and average the gradients across all threads in order to make a single shared weight update. Using larger mini-batches will make more efficient use of multiple threads, but smaller mini-batches can be beneficial because you get to make more weight updates per epoch.

I am not sure if I really got what you mean: The adding and dividing by n to synchronize the weights is correct but I should use mini-batches (= subset of the training set) instead of the whole set. Does this mean I use more threads or do I just use like 50% of the data in a thread per epoch? This "sweet spot", how do I know where it is and is there a specific strategy to find it? — CptK, Aug 28 '21 at 20:53
@CptK: You will need to experiment with suitable mini-batch sizes, and these will vary depending on the problem. You may also find that the approach that makes most efficient use of threading is not the fastest overall, and there may even be measurable differences in learning metrics that mean the neural network has best results when neither fastest nor most efficient (although IME above a relatively small mini-batch size, the batch-size choice is mainly about speed). — Neil Slater, Aug 28 '21 at 21:01

Parallelize Backpropagation - How to synchronize the weights of each thread?

1 Answers1