Skip to content

BLISlab: A Sandbox for Optimizing GEMM

Jianyu Huang edited this page Aug 19, 2016 · 3 revisions

Abstract

Matrix-matrix multiplication is a fundamental operation of great importance to scientific computing and, increasingly, machine learning. It is a simple enough concept to be introduced in a typical high school algebra course yet in practice important enough that its implementation on computers continues to be an active research topic. This note describes a set of exercises that use this operation to illustrate how high performance can be attained on modern CPUs with hierarchical memories (multiple caches). It does so by building on the insights that underly the BLAS-like Library Instantiation Software (BLIS) framework by exposing a simplified “sandbox” that mimics the implementation in BLIS. As such, it also becomes a vehicle for the "crowd sourcing" of the optimization of BLIS. We call this set of exercises BLISlab.

Introduction

Matrix-matrix multiplication (GEMM) is frequently used as a simple example with which to raise awareness of how to optimize code on modern processors. The reason is that the operation is simple to describe, challenging to fully optimize, and of practical importance. In this document, we walk the reader through the techniques that underly the currently fastest implementations for CPU architectures.

Basic Linear Algebra Subprograms (BLAS)

The Basic Linear Algebra Subprograms (BLAS) form an interface for a set of linear algebra operations upon which higher level linear algebra libraries, such at LAPACK and libflame , are built. The idea is that if someone optimizes the BLAS for a given architecture, then all applications and libraries that are written in terms of calls to the BLAS will benefit from such optimizations.

The BLAS are divided into three sets: the level-1 BLAS (vector-vector operations), the level-2 BLAS (matrix-vector operations), and the level-3 BLAS (matrix-matrix operations). The last set benefits from the fact that, if all matrix operands are n \times n in size, $ O( n^3 ) $ floating point operations are performed with $ O( n^2 ) $ data so that the cost of moving data between memory layers (main memory, the caches, and the registers) can be amortized over many computations. As a result, high performance can in principle be achieved if these operations are carefully implemented.

Matrix-matrix multiplication

In particular,  with double precision floating point numbers is supported by the BLAS with the (Fortran) call

      dgemm( transa, transb, m, n, k alpha, A, lda, B, ldb, beta, C, ldc )

which, by appropriately choosing transa and transb, computes $$C := \alpha A B + \beta C; \quad C := \alpha A^T B + \beta C; \quad C := \alpha A B^T + \beta C; \quad \mbox{or } C := \alpha A^T B^T + \beta C.$$ Here $ C $ is $ m \times n $ and $ k $ is the “third dimension”. The parameters dla, dlb, and dlc are explained later in this document.

In our exercises, we consider the simplified version of , $$C := A B + C,$$ where $ C $ is $ m \times n $, $ A $ is $ m \times k $, and $ B $ is $ k \times n $. If one understands how to optimize this particular case of dgemm, then one can easily extend this knowledge to all level-3 BLAS functionality.

High-performance implementation

The intricacies of high-performance implementations are such that implementation of the BLAS in general and  in particular was often relegated to unsung experts who develop numerical libraries for the hardware vendors, for example as part of IBM’s ESSL, Intel’s MKL, Cray’s LibSci, and AMD’s ACML libraries. These libraries were typically written (at least partially) in assembly code and highly specialized for a specific processor.

A key paper [@IBM:P2] showed how an “algorithms and architectures" approach to hand-in-hand designing architectures, compilers, and algorithms allowed BLAS to be written in a high level language (Fortan) for the IBM Power architectures and explained the intricacies of achieving high performance on those processors. The Portable High Performance ANSI C (PHiPAC) [@PHiPAC97] project subsequently provided guidelines for writing high-performance code in C and suggested how to autogenerate and tune  written this way. The Automatically Tuned Linear Algebra Software (ATLAS) [@ATLAS; @ATLAS_journal] built upon these insights and made autotuning and autogeneration of BLAS libraries mainstream.

As part of this document we discuss more recent papers on the subject, including the paper that introduced the Goto approach to implementing  [@Goto:2008:AHP] and the BLIS refactoring of that approach [@BLIS1], as well as other papers that are of more direct relevance.

Other similar exercises

There are others who have put together exercises based on . Recent efforts relevant to this paper are by Michael Lehn at Ulm University and a wiki on that we ourselves put together.

We need you!

The purpose of this paper is to guide you towards high-performance implementations of . Our ulterior motive is that our BLIS framework for implementing BLAS requires a so-called micro-kernel to be highly optimized for various CPUs. In teaching you the basic techniques, we are hoping to identify “The One” who will contribute the best micro-kernel. Think of it as our version of “HPC’s Got Talent”. Although we focus in our description on optimization for the Intel Haswell architecture, the setup can be easily modified to instead help you (and us) optimize for other CPUs. Indeed, BLIS itself supports architectures that include AMD and Intel’s x86 processors, IBM’s Power processors, ARM processors, and Texas Instrument DSP processors [@BLIS2; @BLIS3; @BLIS-TI].

Step 1: The Basics

02step1

Step 2: Blocking

03step2

Step 3: Blocking for Multiple Levels of Cache

04step3

Step 4: Parallelizing with OpenMP

05step4

Conclusion

06conclusion

Useful links

07links

Acknowledgments

We thank the other members of the Science of High-Performance Computing (SHPC) team for their support. This research was partially sponsored by the National Science Foundation grant ACI-1148125/1340293 and Intel through its funding of the SHPC group as an Intel Parallel Computing Center.

Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).