Nowadays, we are in the era of the Big Data; therefore we have to deal with a very huge quantity of data which shouldn’t be frightening anymore. Even if the available data is not massive, scaling up comes in handy because it reduces computational time for our experiments: being able to conduct a test in less time means that we can perform more experiments.
Mainly two kinds of scale techniques exist vertical and horizontal scaling. The former means that computational costs are split across different GPUs on the same host; typically there’s an upper bound for the scaling factor. The latter implies that computational costs are split across several devices without any upper bound at all.
As shown in the previous post, we will build an Autoencoder Neural Network for collaborative filtering, but now we are training it on multiple GPUs simultaneously.
Background
First of all, it’s necessary to spend some words about how to distribute the model on GPUs in TensorFlow. Given the computational graph, it has to be replicated on every GPU to be executed simultaneously on different devices. This abstraction is called tower. More generally, by using a tower, we refer to a function for computing inference and gradients for a single model replica. Here, the underlying rationale behind the concept of replicating the model is that training data can be split across GPUs and every tower will compute its gradient; finally, those gradients can be combined into a single one by averaging them.
Model definition
Let’s define an inference and loss function for our model.
Towers
The following function computes the loss for a tower by using the function model_loss and adds the current tower’s loss to the collection “losses”. This collection contains all the losses from the current tower and helps us to compute the total loss efficiently by using the element-wise summation tf.add_n.
Averaging gradients
Every GPU has its own replica of the model with its batch of data; therefore gradients on GPUs are different from each other because of different data they are computed with. It is possible to obtain an approximation of the gradient as it was run on a single GPU by calculating an average of the gradients.
Training
To train the model on different GPUs, we have to split our data into many batches as the number of GPUs. Once every model replica has been trained, a single gradient is computed by averaging all the tower’s gradient.
Conclusions
In the following table are shown the results regarding time and final loss of the same model deployed on a different number of GPUs. The model is trained for 10000 with a learning rate of 0.03 and batch size of 250 using the RMSProp optimizer. As concerns about the hardware, the model has been trained using GeForce GTX 970 GPUs.
GPUs
Time
Loss
1
857.7444069385529
0.02808666229248047
2
562.5094940662384
0.028311876580119133
3
384.3871910572052
0.026195280253887177
4
263.5561637878418
0.02583944983780384
As we can see from the table, we can reduce the total training time using more GPUs achieving the same results.