How we can program in the Keras library (or TensorFlow) to partition training on multiple GPUs? Let's say that you are in an Amazon ec2 instance that has 8 GPUs and you would like to use all of them to train faster, but your code is just for a single CPU or GPU.
4 Answers
From the Keras FAQs, below is copy-pasted code to enable 'data parallelism'. I.e. having each of your GPUs process a different subset of your data independently.
from keras.utils import multi_gpu_model
Replicates model
on 8 GPUs.
This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
This fit
call will be distributed on 8 GPUs.
Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
Note that this appears to be valid only for the Tensorflow backend at the time of writing.
Update (Feb 2018):
Keras now accepts automatic gpu selection using multi_gpu_model, so you don't have to hardcode the number of gpus anymore. Details in this Pull Request. In other words, this enables code that looks like this:
try:
model = multi_gpu_model(model)
except:
pass
But to be more explicit, you can stick with something like:
parallel_model = multi_gpu_model(model, gpus=None)
Bonus:
To check if you really are utilizing all of your GPUs, specifically NVIDIA ones, you can monitor your usage in the terminal using:
watch -n0.5 nvidia-smi
References:

- 1,012
- 4
- 12
- 24

- 656
- 6
- 5
-
1Does the
multi_gpu_model(model, gpus=None)
work in the case where there is only 1 GPU? It would be cool if it automatically adapted to the number of GPUs available. – CMCDragonkai Aug 30 '18 at 06:25 -
Yes I think it works with 1 GPU, see https://github.com/keras-team/keras/pull/9226#issuecomment-361692460, but you might need to be careful that your code is adapted to run on a multi_gpu_model instead of a simple model. For most cases it probably wouldn't matter, but if you're going to do something like take the output of some intermediate layer, then you'll need to code accordingly. – weiji14 Sep 06 '18 at 03:35
-
-
You mean something like https://github.com/rossumai/keras-multi-gpu/blob/master/blog/docs/index.md? – weiji14 Sep 10 '18 at 04:17
-
That reference was great @weiji14. However I'm also interested in how this works for inference. Does keras somehow split batches equally or round robin schedule on available model replicas? – CMCDragonkai Dec 10 '18 at 06:00
-
I reported the batch size issue during multi-GPU inference: https://github.com/keras-team/keras/issues/11844 – CMCDragonkai Dec 12 '18 at 01:31
- For TensorFlow:
Here is the sample code on how is used, so for each task is specified the list with devices/device:
# Creates a graph.
c = []
for d in ['/gpu:2', '/gpu:3']:
with tf.device(d):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(sum))
tf will use GPU by default for computation even if is for CPU (if is present supported GPU). so you can just do a for loop: "for d in ['/gpu:1','/gpu:2', '/gpu:3' ... '/gpu:8',]:" and in the "tf.device(d)" should include all your instance GPU resources. So tf.device() will actually be used.
Scaling Keras Model Training to Multiple GPUs
- Keras
For Keras by using Mxnet than args.num_gpus, where num_gpus is the list of your required GPUs.
def backend_agnostic_compile(model, loss, optimizer, metrics, args):
if keras.backend._backend == 'mxnet':
gpu_list = ["gpu(%d)" % i for i in range(args.num_gpus)]
model.compile(loss=loss,
optimizer=optimizer,
metrics=metrics,
context = gpu_list)
else:
if args.num_gpus > 1:
print("Warning: num_gpus > 1 but not using MxNet backend")
model.compile(loss=loss,
optimizer=optimizer,
metrics=metrics)
- horovod.tensorflow
On top of all Uber open sourced Horovod recently and I think is great:
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model…
loss = …
opt = tf.train.AdagradOptimizer(0.01)
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=“/tmp/train_logs”,
config=config,
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)

- 619
- 3
- 11
Basically, you can take example of the following example. All you need is specifying cpu and gpu consumption values after importing keras.
import keras
config = tf.ConfigProto( device_count = {'GPU': 1 , 'CPU': 56} )
sess = tf.Session(config=config)
keras.backend.set_session(sess)
After then, you would fit the model.
model.fit(x_train, y_train, epochs=epochs, validation_data=(x_test, y_test))
Finally, you can decrease the consumption values not the work on upper limits.

- 121
- 3
-
Session seems to be in compat for Tensorflow 2.0, but not main function. – EngrStudent Jan 06 '20 at 20:13
Simple example for how we can access multiple GPUs with Horovd and Keras: Github code Keras MNIST Example with Horovod.
Plus, please go to the link for further info: Horovod with Keras

- 1,012
- 4
- 12
- 24

- 11
- 3
and that is ? I will try like that :)
– Hector Blandin Oct 19 '17 at 02:36