Florin Rusu - SGD on Highly-Parallel Architectures

Stochastic Gradient Descent on Highly-Parallel Architectures

In this work, we perform the first comprehensive study of parallel SGD that investigates the combined impact of three axes - computing architecture, model update strategy, and data sparsity - on three measures - hardware efficiency, statistical efficiency, and overall time to convergence. We allocate a significant part of the study to the design of a novel asynchronous SGD algorithm on GPU - a missing topic in the existing literature - that explores exhaustively possible data organization, access, and replication alternatives. To this end, we introduce an optimized asynchronous SGD algorithm for GPU that leverages architectural characteristics such as warp shuffling and cache coalescing for data and model access.

Exploratory axes. On the computing architecture axis, we consider multi-core CPUs with Non-Uniform Memory Access (NUMA) and many-core GPUs with wide Single Instruction Multiple Data (SIMD) processing units. The model update strategies we consider are synchronous and asynchronous. Synchronous updates follow a transactional semantics and allow a single thread to update the model. While this strategy limits the range of parallel execution inside the SGD algorithm, it is suitable for batch-oriented high-throughput GPU processing. In the asynchronous strategy, multiple threads update the model concurrently. Our focus is on the Hogwild algorithm which ignores any synchronization to the shared model. Data sparsity represents the third axis. At one extreme, we have dense data in which there is a non-zero entry for each feature in every training example. This allows for a complete dense 2-D matrix representation. When the model is large, it is often the case that the examples have only a few non-zero features. A sparse matrix format, e.g., Compressed Sparse Row (CSR), is the only alternative that fits in memory in this case. The majority of the GPU solutions implement synchronous model updates over dense data, while the CPU implementations use asynchronous Hogwild which is suited for sparse data. In this paper, we explore the complete space and map the remaining combinations. We design an efficient Hogwild algorithm for GPU that is carefully tuned to the underlying hardware architecture, the SIMD execution model, and the deep GPU memory hierarchy. The considerably larger number of threads available on the GPU and their complex memory access pattern pose significant data management challenges in achieving an efficient implementation with optimal convergence. The Hogwild GPU kernel considers data access path and replication strategies for data and the model to identify the optimal configuration for a given task-dataset input. Moreover, we introduce specific optimizations to enhance the coalesced memory access for SIMD processing. Synchronous SGD on CPU turns out to be a simplification of the GPU solution. Instead of executing the linear algebra kernels on the GPU, invoke functions on the CPU. This is easily achieved by using computational libraries with dual CPU and GPU support. Since the CPU functions do not require data transfer, they have the potential to outperform a sequence of GPU kernels that are not cross-optimized. The massive number of powerful threads in our testbed CPU provides an additional boost in performance.

Performance axes. The hardware efficiency measures the average time to do a complete pass - or iteration - over the training examples. Ideally, the larger the number of physical threads, the shorter an iteration takes since each thread has less data to work on - thus, higher hardware efficiency. In practice, though, this holds only when there is no interaction between threads - even then, the size and location of data can be limiting factors. In asynchronous SGD, however, the model is shared by all (or a group of) the threads. This poses a difficult challenge both in the CPU and GPU case. For CPU, the implicit cache coherency mechanism across cores can decrease the hardware efficiency dramatically. Non-coalesced memory accesses inside a SIMD unit have the same effect on GPU. The statistical efficiency measures the number of passes over the data until a certain value of the loss function is achieved, e.g., within 1% of the minimum. This number is architecture-independent for synchronous model updates which are executed at the end of each pass. In the case of asynchronous model updates during a data pass, however, the number and order of updates may have a negative impact on the statistical efficiency. The third performance axis is represented by the time to convergence. This is, essentially, the product between the hardware and statistical efficiency. The reason we include it as an independent axis is because there are situations when two algorithms have reversed hardware and statistical efficiency - algorithm A has better hardware efficiency than algorithm B and worse statistical efficiency - and only the time to convergence allows for a full comparison. Such a case is common for synchronous and asynchronous updates and for CPU and GPU execution, respectively.

Summary of results.

For synchronous SGD, GPU is always faster than parallel CPU in time per iteration and, thus, in time to convergence. The gap between GPU and parallel CPU is larger for complex deep nets models (5X on average), while super-linear speedup (>400X) is achieved over sequential CPU for LR and SVM. These results are on par or better than the synchronous solutions from TensorFlow and BIDMach. However, the speedup is nowhere close to the degree of parallelism gap between CPU and GPU.
Asynchronous SGD on CPU always outperforms GPU in time to convergence, even when GPU has a speedup larger than 10X in hardware efficiency. The gap between CPU and GPU is higher than 5X on sparse data and deep nets.
While GPU is the optimal architecture for synchronous SGD and CPU is optimal for asynchronous SGD, choosing the better of synchronous GPU and asynchronous CPU is task- and dataset-dependent. Thus, CPU should not be easily discarded. From a financial perspective, though, GPUs are likely the more cost-effective alternative.

This project is supported by a DOE Early Career Award. The results are published in the following papers:

Stochastic Gradient Descent on Modern Hardware: Multi-core CPU or GPU? Synchronous or Asynchronous? by Y. Ma, F. Rusu, and M. Torres
Stochastic Gradient Descent on Highly-Parallel Architectures by Y. Ma, F. Rusu, and M. Torres

The code, experimental data, and configuration parameters are available on Yujing Ma's Github account.

UC Merced | EECS | Home | SGD

Last updated: Friday, July 26, 2019