Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

An Energy-Efficient Sparse-BLAS Coprocessor using STT-MRAM

Abstract

Sparse linear algebra arises in a wide variety of computational disciplines, including medical imaging, 3D graphics, compressive sensing, neural networks, bioinformatics, and various optimization problems. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing units (GPUs) have become the status quo for performing sparse linear algebra in high-performance computing (HPC) environments. However, the computational throughput of these libraries for sparse matrices tends to be significantly lower than that of dense matrices, mostly due to the fact that the compression formats required to efficiently store sparse matrices mismatches traditional computing architectures. This presents a problem, particularly in a mobile environment, where consumer demand for smart phones and tablets has dictated ever increasing computational performance on a limited energy budget.

To address this issue, we have carefully modeled the computational efficiency of sparse algorithms on CPUs and GPUs to identify the computational and memory bottlenecks in their architectures. Using this we have developed a sparse linear algebra kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources. Benchmarking results on a Virtex-5 SX95T field-programmable gate array (FPGA) prototype demonstrate an average computational efficiency of 91.85%. The kernel achieves a peak computational efficiency of 99.8%, a >50x improvement over state-of-the-art CPUs and a >300x improvement over state-of-the-art GPUs. In addition, the sparse linear algebra FPGA kernel is able to achieve higher performance than its CPU and GPU counterparts, while using only 64 single-precision processing elements, with an overall 23-30x improvement in energy efficiency.

An ASIC implementation, in a 40nm 1P10M CMOS process, of the sparse linear algebra kernel is able to achieve a maximum performance of 4.12 GFLOP/s. The minimum energy point (190.31 GFLOP/s/W at 0.6V and 160MHz) shows an energy efficiency improvement of more than a 3,073x, 2,262x, and 66.6x over the CPU, GPU, and FPGA implementations, respectively. Additionally, a data stream reordering scheme was able to eliminate over 99% of data hazards in 14 test matrices for an average boost of 20% in computational efficiency over the FPGA implementation. Further improvements in the energy efficiency could be made by replacing the on-chip SRAM with spintronic memories. Fabrication results from three STT-MRAM chips and two MeRAM chips are also reported.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View