High Performance Distributed Deep Learning with multiple GPUs on Google Cloud Platform — Part 2 | by Sam | Dec, 2020
A short primer on scaling up your deep learning to multiple GPUs
In this multipart article, I outline how to scale your deep learning to multiple GPUs and multiple machines using Horovod, Uber’s distributed deep learning framework.
Read part one here:
Unsurprisingly, getting distributed training to work correctly isn’t as straightforward. You can follow along with my steps to get your experiment loaded and training on GCP.
- Package/restructure your application (see github repo here for an example)
- Create a docker image and load that image to Google’s Cloud Registry
- Create an instance and run your training job
If everything was configured correctly, you should now have an easy-to-follow recipe for parallelizing your deep learning on GCP.
Horovod essentially runs your training script on multiple processes, and then uses gradient averaging. If you’re interested on how Data Parallelism/Model Parallelism work for distributed deep learning — I suggest reading this article, which is a nice summary:
(Thanks to Google for upping my GPU quota)
It’s a good practice to structure your project so it’s simple to run and containerize. If you’re interested in examples for how to do this, check out this repo that contains the template I use for my research projects:
You’ll need to structure your application to run as a script. This means putting all downloading and preprocessing of your data in the script. When you use Torch’s distributed training sampler, your data will automatically be distributed to each process in a non-overlapping way.
Depending on how complex your data pipeline is, it’s nice to fit this all into the same script. However, that’ll depend on your project. Some nice examples for structuring your train script are here:
Horovod runs training like so:
horovodrun -np 4 -H localhost:4 python app/run_train.py
For more information on how to run Horovod — check the link here: https://horovod.readthedocs.io/en/stable/running_include.html
Now that you’ve tested your training script locally — you should do this, you can build your container. You’ll need to make sure the container has cuda installed. To make this simple, I’ve provided an example project here — feel free to use the attached Dockerfile which has been tested to work with this example.
The instance details that I’ve provided are below. In order to use GPUs, you’ll need to select a compatible instance type:
If you’re deploying to multiple machines, the
main machine will need to be able to SSH into each
worker without a prompt. I may cover this at a later date.
When launching your instance, specify the source image as the image you’ve uploaded. This will automatically download the image and run it, saving you precious time with a potentially expensive machine.
Total training time (1 GPU): ~29 minutes
Total training time (4 GPUs): ~6 minutes
A nearly 4 fold increase in performance!
Because Horovod uses data parallelism, the time per epoch remains roughly the same. See statistics for each device in the image above.
However, since the training data is distributed to each GPU and the gradients are averaged, the number of epochs needed to converge is roughly a function of the number of GPUs.
So, previously, if we needed 100 epochs to converge with 1 GPU, we can estimate that it’ll only take ~25
(100/4) epochs to converge with 4 GPUs to around the same accuracy.
It’s not a law, so there may be some unexpected variation here. The training dynamics are definitely different. In this case, I noticed a “steeper curve,” where the error was much higher initially, and converged much faster. You’ll need to monitor the training as always to ensure there aren’t any issues with convergence.
If you’re looking for a monitoring tool and haven’t used weights and biases yet — try them out.
Thanks for reading!
If you liked this article, check out:
Read More …