Nowadays, we are in the era of the Big Data, therefore we deal with very huge quantity of data which shouldn’t be frightening anymore. Even if the available data is not huge, scaling up comes in handy because it reduces computational time for our experiments: being able to conduct an experiment in less time means that we can conduct more experiments.
Mainly two kind of scale techniques exist: vertical and horizontal scaling. The former means that computational costs are splitted across different GPUs on the same host; tipically there’s an upper bound for the scaling factor. The latter means that costs are splitted across several devices without any upper bound at all.
As shown in the previous post, we will build an Autoencoder Neural Network for collaborative filtering, but now we are training it on multiple GPUs simultaneously.
First of all, it’s necessary to spend some words about how to distribute the model on GPUs in TensorFlow. Given the computational graph, it has to be replicated on every GPU in order to be executed simultaneously on different devices. This abstraction is called tower. More generally, by using tower we refer to a function for computing inference and gradients for a single model replica. Here, the basic rationale behind the concept of replicating the model is that training data can be splitted across GPUs and every tower will compute its gradient; finally, those gradients can be combined into a single one by averaging them.
Let’s define an inference and loss function for our model.
The following function computes the loss for a tower by using the function model_loss and adds the current tower’s loss to the collection “losses”. This collection contains all the losses from the current tower and helps us to compute the total loss efficiently by using the element-wise summation tf.add_n.
Every GPU has its own replica of the model with its batch of data, therefore gradients on GPUs are different from each others because of different data they are computed with. It is possible to obtain an approximation of the gradient as it were run on a single GPU by computing an average of the gradients.
In order to train the model on different GPUs, we have to split our data in a number of batches equal to the number of GPUs. Once every model replica has been trained, a single gradient is computed by averaging all the tower’s gradient.
In the following table are shown the results in terms of time and final loss of the same model deployed on different number of GPUs. The model is trained for 10000 with a learning rate of 0.03 and batch size of 250 using the RMSProp optimizer. As concerns about the hardware, the model has been trained using GeForce GTX 970 GPUs.
As we can see from the table, we can reduce the total training time using more GPUs achieving the same results.