How to optimize cache usage when multiplying sparse matrices in parallel
First, let’s start with some theory.
I’m assuming that you, the reader, already know what a matrix is since it’s a really common concept to find in mathematics. What you may not know is the definition of a sparse matrix. Surprisingly enough, there’s none! No formal one, at least, but only rule of thumbs. This is due to the fact that the sparsity of a matrix is not a “strong property” like the diagonality. So, the definition that I like the most is this one: “A matrix is defined as sparse when its number of non-zero elements is comparable to the number of rows or columns”. So, for example, a matrix with 1000 rows and 1000 columns could contain up to 1’000’000 non-zero elements but, in order to be defined as sparse, it can only contain up to (around) 10’000 elements, less than 1%.
When performing matrix calculations with computers, it’s really common to find much bigger matrices, so big in fact that they can only be stored in their entirety on disk and not in the RAM. For this reason, the sparsity of this type of matrices can be exploited in order to save up space. Some special formats have been developed during the past decades and RSB, the one my project is about, is the best one in terms of cache efficiency. But let’s go in order.
The first one, and also the simplest one, to be invented is the COO format. It’s a bad acronym for COOrdinate list. As the name suggests, a sparse matrix gets represented as a list, or array, of triplets, for each non-zero element, each one containing: the row index of the element, the column index of the element and the value of the element. This one is already pretty effective at its job since the amout of memory required to use this format does not depend on the actual size of the matrix but just on the number of non-zero elements inside it, the actual information we want to keep.
After COO, somebody has invented the CSR and CSC formats, which stand for Compressed Sparse Rows/Columns. These two are an improved version of the first one since they try, as the name suggests, to compress the rows/columns that have too many elements. In fact, they compress contiguous adjacent row indices of elements to remove even more redundancy from the matrix representation.
The RSB format, which stands for Recursive Sparse Blocks, is a bit more complicated to say the least. It is a hybrid sparse matrix storage format which combines three data structures at three different levels to both obtain a very good memory usage and get a really high cache efficiency. At the highest level, the root level, submatrices are stored with a Z-Morton sorting. This is a space-filling curve (you can see an example in this post’s image), similar to the Hilbert’s spiral, that helps preserving spatial locality of bidimensional data points, meaning that each blocks is stored close to its neighbors in memory. At the intermediate level, submatrices/blocks are organized in a Quad-tree data structure. This part is crucial because it allows to parallelize matrix operations better since each block is subdivided based on an estimate of its size in memory. This way the RSB format assures that each blockcan be stored entirely in the CPU cache, thus improving the overall performance. Finally, at the lowest level, the leaf level, each submatrix/block is stored using the common COO, CSR or CSC formats based how many non-zero elements are present and how they are distributed around the matrix.
As you can probably tell by the length of this post, the RSB format is pretty complicated and the algorithms that use it are, too. Surely, it will be a challenge to port this code into CUDA and we’ll see how it turns out in the next post.