The use of Graphical Processing Units (GPU) for scientific computing has been an area of active research since as early as 1978 [1]. For the past few years, the demand for more realistic games in an ever growing market has lead GPU manufactures to produce high end graphics cards that are able to render complex scenes at very fast frame rates. As graphics problems are embarrassingly parallel in nature, GPUs have been designed as massively parallel architectures. Furthermore, to deal with the developments in graphics programming and the increasingly more complex processing, GPUs have gradually made a transition from fixed-function pipeline devices that are only able to perform fixed operations, to devices that include programmable components. As the instruction sets of these components grew in complexity, they offered a viable platform for use in scientific computing.
Describing a scientific problem in terms of graphics primitives is a daunting task that task that requires a strong understanding of graphics programming. For that reason research into this field began by graphics hardware research groups. Higher level but still domain specific languages were designed to make programming GPUs easier. These languages would compile at runtime to a lower level language suitable for the targeted device. They also led to further development of general languages that finally introduced GPU programming to scientist that have no graphics backgrounds. Yet, while writing GPGPU programs for GPUs became easier, the hardware itself hampered GPGPU by not supporting many of the features used in general purpose programming such as scatter operations, indirect addressing, thread synchronization and shared memory.
The large commodity GPU market and ruthless competition have allowed the production of relatively low cost devices when compared to HPC products with very high floating point operations per second (FLOPS) ratings. In fact, the gap between general purpose Central Processing Units (CPU) and GPUs are increasing in terms of peak Floating Point Operations per Second (FLOPS). GPUs, due to their graphics nature are Single Instruction Multiple Data architectures and are massively multi threaded. This allows the GPU manufactures to save huge amounts of die space by requiring a lot less control logic (SIMD) and a lot less cache (large amounts of threads mask memory latencies). The saved space can then be filled with Arithmetic Logic Units (ALU). While current generation GPUs are currently only able to perform single precision floating point arithmetic natively, there are still many applications in the scientific computing field that would consider single precision floating point arithmetic acceptable. For many other applications, numerical methods can be used to produce high precision results [2] and still preform better on the GPU then the CPU [3]. Furthermore next generation GPUs will be able to perform double precision arithmetic natively.
NVIDIA, having seen the potential for General-Purpose computation on GPUs (GPGPU) released a platform named CUDA to facilitate GPGPU on their new devices. One of their GPUs supported by this platform is the G8800GTX. This GPU was evaluated for use in Statistical Machine Learning (SML) applications. SML applications attempt to infer a functional dependency between some empirical data set and results. Many of these application rely on a iterative matrix vector products to produce results. Matrix vector products are easily parallizable and good candidate for migration to the GPU.
CUDA provides a BLAS library that is utilized to evaluate matrix vector products on the GPU. As many datasets are very sparse (99%) in nature, sparse matrix vector products are developed and evaluated on the GPU.
One of the key objectives in Machine Learning (ML) is, given some patterns xi, such as pictures of apples and oranges, and corresponding labels yi, such as the information whether xi is an apple or an orange, to find some function f which allows us to estimate y from x automatically. See e.g. [4] for an introduction. In this quest, convex optimization is a key enabling technology for many problems. For instance, Teo et al [5] proposed a scalable convex solver for such problems. It is an iterative algorithm that involves guessing a solution vector w, using this to evaluate a loss function l(x,y,w) and its derivative g = dw l(x,y,w), and then updating w accordingly. This process is repeated until a desired level of convergence is achieved. As mentioned above the majority of time is spent evaluating the matrix vector products, and the elements of matrix (X) do not change between iterations.
Many ML datasets are very sparse (99% and above). Exploiting the sparsity decreases the memory footprint of the matrix as well as the the number of floating point operations required for the matrix vector product. Unfortunately it also introduces random memory access patterns and indirect addressing, which is likely to result in less efficient utilization of a GPU's hardware.
The Compute Unified Device Architecture (CUDA), is a hardware and software architecture that enables the issuing and managing of computations on the GPU as a data-parallel device without the need to map the computations to a graphics API. Instead of having separate vertex and fragment processor, NVIDIA have introduced a unified architecture. The new architecture is based on a massively multi-threaded design where vertex and fragment shaders are executed on the same hardware.
Since synchronization and data sharing is only available between threads running on the same SM, CUDA organizes threads into blocks. A block will always run o the same SM. CUDA also allows defining a block as a 1, 2 or 2 dimensional array of threads. This is beneficial when the data is not uni dimensional in nature as it allows each thread to have a index in the x, y and z coordinates. All threads in a block have the same view of shared memory and can use a synchronization barriers together.
Multiple blocks form a grid (all threads for a single kernel). A grid can be defined as 1 or 2 dimensional group of blocks to better suit the data. CUDA recommends hundreds of blocks to be defined as multiple blocks can run on the same SM concurrently if the SM has enough free resources.
Threads are executed in warps (32 threads), though current generation hardware can only execute 16 threads per cycle. The is achieved by having the SMs running at twice the speed of the rest of the GPU so each SP can perform 2 operations in one GPU cycle.
Memory is one of the more complex parts of CUDA and is organized a follows:
In addition CUDA introduces texture references that allow memory accesses to be cached. Texture cache is optimized for 2D spacial locality so accessing memory references that are closer together will result in better performance. With texture references a cache hit will lower the pressure on DRAM but will not lower fetch latency. Random access patterns will preform better with texture references than with other memory access methods.
Provided with CUDA are Basic Linear Algebra Subprograms (BLAS) and Fast Fourier Transform (FFT) implementations, however NVIDIA only provides a C/C++ API for these. A programming guide [6] providing additional information is available from NVIDIA (http://www.nvidia.com)
| 1 | J. N. England. A system for interactive modeling of physical curved surface objects. SIGGRAPH Comput. Graph., 12(3):336-340, 1978. |
| 2 | Takeshi Ogita, Siegfried M. Rump, and Shin'ichi Oishi. Accurate sum and dot product. |
| 3 | Dominik Göddeke, Robert Strzodka, and Stefan Turek. Accelerating double precision FEM simulations with GPUs. In Proceedings of ASIM 2005 - 18th Symposium on Simulation Technique, Sep 2005. |
| 4 | Valdimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. |
| 5 | Choon Hui Teo, Alex Smola, S. V.N. Vishwanathan, and Quoc Viet Le. A scalable modular convex solver for regularized risk minimization. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 727-736, New York, NY, USA, 2007. ACM. |
| 6 | NVIDIA. NVIDIA CUDA Programming Guide, 1.1 edition, November 2007. |