Fastest sorting algorithm for distributed systems (Parallel Radix Sort) [Difficulty: Medium]
The time has come to talk about parallelization and MPI, the heart of our topic and my project. First, the problem the parallel Radix Sort can solve will be presented. Then, we will see the algorithm before talking about my implementation. Lastly, it will be interesting to expose performance results.
Problem the parallel Radix Sort can solve
Every day we collect more and more data to be analyzed. With nowadays large-scale data and large-scale problems, parallel computing is a powerful ally and important tool. While processing data, one recurrent step is the sorting. In my second blog post, I have highlighted the importance of sorting algorithms. Parallel sorting is one of the important component in parallel computing. It allows us to reduce the sorting time and to sort more data, amounts that can’t be sorted serially. Indeed, it is possible when we want to sort a huge amount of data that it can’t fit on one single computer because computers are memory bounded. Or, it may take too much time using only one computer, a time we can’t afford. Thus, for memory and time motivations, we can use distributed systems and have our data distributed across several computers and sort them in that way. But how? Let’s define properly the context and what we want before.
What we have
- P resources able to communicate between themselves, store and process our data. They can be CPUs, GPUs, computers… Each resource has a unique rank between 0 and P-1.
- N data, distributed equally across our resources. This means that each resource has the same amount of data.
- The data are too huge to be stored or processed by only one resource.
What we want
- Sort the data across the resources, in a distributed way. After the sort, resource 0 should have the lower part of the sorted data, resource 1 the next and so on.
- We want to be as fast as possible and consume the lowest memory possible on each resource.
Like all distributed and parallel systems, an issue we have to be careful of is communications between the resources. They have to be as minimal as possible to make parallel algorithms efficient. Otherwise, we spend too much time in communications instead of treating the problem itself. Through communications, we can exchange data and information between the resources. The more amount of data we exchange, the more it will take time.
There is a bunch of parallel sorts and among them sample sort, bitonic sort, column sort, etc. Each of them has its pros and cons. But, as far as I know, they are not so many satisfying all of our requirements above. Often, they need to gather all the data or a huge part of them on one resource, at one point or another, to be efficient. This is not suitable. They can be managed without gathering the data on one resource, but, most of the time, they are not efficient enough anymore because of the communication overhead involved. The parallel Radix Sort is one of those which meets our requirements while being efficient. It is currently known as the fastest internal sorting method for distributed-memory multiprocessors.
Now, we will use the word processor instead of resource because it is the word often used in HPC.
Parallel Radix Sort
I recommend to read my two previous blog posts where I have detailed the serial Radix Sort because the parallel version is entirely based on it. So if you don’t know the Radix Sort, it is better to read them first. The notations used here have been introduced in the previous posts.
In general, parallel sorts consist of multiple rounds of serial sort, called local sort, performed by each processor in parallel, followed by movement of keys among processors, called the redistribution step. Local sort and data redistribution may be interleaved and iterated a few times depending on the algorithms used. The parallel Radix Sort also follows this pattern.
We assume that each processor has the same amount of data otherwise processors workload would be unbalanced because of the local sort. Indeed, if one processor has more data than the others, it will take more time to achieve its local sort and the total sorting time will be greater. However, if the data are equally distributed, the processors will take the same time to achieve their local sort and none of them will have to wait for another to finish. As a result, the total sorting time will be shorter.
The idea of the parallel Radix Sort is, for each key digit, to first sort locally the data on each processor according to the digit value. All the processors do that concurrently. Then, to compute which processors have to send which portions of their local data to which other processors in order to have the distributed list sorted across processors and according to the digit. After iterating for the last key digit, the distributed array will be sorted as we want. Below is the parallel algorithm.
Input: rank (rank of the processor), L (portion of the distributed data held by this processor) Output: the distributed data sorted across the processors with the same amount of data for each processor 1. for each keys digit i where i varies from the least significant digit to the most significant digit: 2. use Counting Sort or Bucket Sort to sort L according to the i’th keys digit 3. share information with other processors to figure out which local data to send where and what to receive from which processors in order to have the distributed array sorted across processors according to the i’th keys digit 4. proceed to these exchanges of data between processors
Each processor runs the algorithm with its rank and its portion of the distributed data. This is a “high-level” parallel Radix Sort algorithm. It describes what to do but not how to do it because there are so many ways of doing the steps 3. and 4. depending on many parameters like the architecture, environment and communication tool used. Let’s go through an example and then I will explain how I have chosen to implement it.
We start with the distributed unsorted list above to run our parallel Radix Sort. P equals 3 because there are 3 processors and N equals 15. We will run the example in base 10 for simplicity but don’t forget that we can use any number base we want and in practice, base 256 is used as explained in my previous posts. Also for simplicity, we deal with unsigned integer data and the sorting keys are the numbers themselves. To sort signed integers and other data types, please refer to my previous article.
The keys having less digits (2, 1 and 5) have been extended to the same number of digits than the maximum in the list (two here) by adding zero as MSD. If you don’t understand why, please refer to my second article. It doesn’t change the value of the keys. The challenging part when implementing can be the redistribution step. We have to think about, for a given processor, what information from other processors is required to figure out where to send the local data. It is not complicated, we want the distributed array to be sorted according to one particular key digit (the i’th digit). In our example, the value of a digit is between 0 and 9. We have sorted locally the data on each processor, therefore, each processor knows how many data there are having their i’th key digit equals to a given digit value. By sharing this information with the other processors and getting their information too, we can determine which data each processor has to send and receive. In the example above, all the processors (from the rank 0 to P-1) have first to send their local data having their key LSD equals to 0 to the processor p0 until this one can’t receive data anymore because it is full. There is no such data so we continue with the next digit value which is 1. The processor p0 keeps its data having their key LSD equals to 1, then receive those from the processor p1, and finally those from the processor p3 but it doesn’t have. The processors order really matters and has to be respected. Once done, we repeat the same with the value 2, then 3 and so on until 9. When p0 is full, we continue by filling p1 and so on. Careful, the redistribution step has to be stable too, like the sort used in each iteration of the Radix Sort and for the same reasons. This is why we have said the order matters and has to be respected, otherwise it doesn’t sort.
The algorithm is still correct if we swap the local sort step and the data redistribution step. But, in practice, it is not suitable because often, to send data efficiently, we need the data to be contiguous in memory. Two data having their i’th digit identical will most probably be sent to the same processor, so it is a good point to sort the data locally before.
We still have one iteration to do according to the algorithm. Let’s finish.
We have done the same thing but according to the last digit this time. The algorithm is now finished and, good news, our distributed array is sorted as we wanted.
I have presented here the parallel LSD Radix Sort. But, like the serial, there is a variant called parallel MSD Radix Sort which is the parallelization of the serial MSD. I have implemented both to sort int8_t data type and my performance results were better with my LSD version. This is why I have continued with the LSD to generalize it to other integer sizes. It is also the reason why since the beginning I am focusing on presenting the LSD version and I don’t really go further into detail with the MSD.
MPI is used for communications between processors.
I have used the Counting Sort and not the Bucket Sort because my implementation with the Bucket Sort had an extra charge due to memory management. Indeed, unless making a first loop through the keys to count before moving them into the buckets, we don’t know in advance the length of the buckets. Therefore, each of my buckets was a std::vector and although they are very well implemented and optimized I still had a lack of performance due to memory management. The problem is absolutely not due to the std::vector class, it comes from the fact that on each processor, each bucket has a different size, depending on the key characteristics, and we can’t predict them. So, instead of making a first loop to count and find out the length of each bucket before creating them with appropriate sizes, I opted for the Counting Sort which is finally almost the same because we also count before moving the data. The difference is we don’t use buckets anymore, we use a prefix sum instead.
To do step 3. of the algorithm, I save the local counts from the Counting Sort on each processor and share it with other processors via MPI_Allgather. In that way, each processor knows how many keys having their i’th byte equals to a given value there are on each processor, and from that, it is easy to figure out where to send which data as explained in the Parallel Radix Sort section example (just above this “My implementation” section).
Step 4. is managed using MPI one-sided communications instead of send and receive which are two-sided communications. I have tried with both and most of the time either performances were similar or better with one-sided communications. MPI one-sided communications are more efficient when the hardware supports RMA operations. We can use them in step 4. of parallel Radix Sort because we don’t need synchronization for each data movement, but only once they are all done. The data movements are totally independent in step 4., they can be made in any order as long as we know where each element has to go.
In terms of memory, on each processor, I am using a static array (std::array) of 256 integers for the local counts. In addition to that, I have the input array “duplicated”. The original local array is used to receive the data from other processors in step 4. Thus, at the end, the array is sorted in this same input buffer. The copy is used to store the local sorted data and send them to the right processor. It is possible to implement it without duplicating the array, but a huge amount of communications and synchronizations will be necessary and the communications won’t be independent anymore. In my opinion, it is too much time lost for the memory gain.
As said previously, there are several ways of doing steps 3. and 4. It is also possible for example, to build the global counts across processors (for step 3.) using other MPI collective communications like MPI_Scan. To send the data in step 4. we can also use MPI_Alltoallv instead of one-sided communications but it requires to sort again the data locally after receiving. I tried several alternatives and what I have exposed here is the one giving me the best performances.
As hardware, I have used the ICHEC Kay cluster to run all the benchmarks presented in this section. The framework used to get the execution times is google benchmark. The numbers to sort are generated with std::mt19937, a Mersenne Twister pseudo-random generator of 32-bit numbers with a state size of 19937 bits. For each execution time measure, I use ten different seeds (1, 2, 13, 38993, 83030, 90, 25, 28, 10 and 73) to generate ten different arrays (of a same length) to sort. Then, the execution times mean is registered as the execution time for the length. These ten seeds have been chosen randomly. I proceed like that because the execution time also depends on the data, so it allows us to have a more accurate estimation.
Strong scaling is defined as how the solution time varies with the number of processors for a fixed total problem size.
We can notice that my implementation seems good because, in this case, it is better than std::sort and boost::spreadsort whatever the number of processors used. Plus, it seems to be very close to the perfect scalability. My implementation here is faster than boost::spreadsort even in serial (with only one processor) because when the array is not distributed, I am using ska_sort_copy instead of boost::spreadsort. ska_sort_copy is a great and very fast implementation of the serial Radix Sort. Actually, it is the fastest I have found. This is why it is my reference to compute the perfect scalability.
The problem with the previous graph is the scale. It is difficult to distinguish the perfect scalability and the dimer::sort curves because the times are too low contrary to std::sort execution time. This is why we often use a log-log scale to plot execution times in computer science.
With a log-log scale we can see better what is happening. Now, we are able to tell that until 8 processors, the perfect scalability and dimer::sort curves are the same. It means until 8 processors, theoretically, we can’t do better than what we already have to sort 1GB int8_t. But, from 8 processors it is not the case anymore and it is most probably due to communication overhead because there are more and more processors. Let’s verify it with another graph.
Distribution of execution time
Indeed, when the number of processors increases, the communication part in the total execution time increases too. The local sort step (in blue) contains zero communication. All communications are done during the data redistribution step (in orange). These communications are one MPI_Allgather and several MPI one-sided communications.
Until now, we have seen performance results only for int8_t. It is good but limited because the int8_t value range is too small and not representative of real-world problems and data. What about int32_t which is the “normal” integer size?
int32_t, sorting time as a function of the array length
The performances are also satisfying. Using dimer::sort we go faster than anything else when the array is enough big and this is what we expect when we use HPC tools. For small array sizes, we are not really expecting an improvement because they can be easily well managed serially and enough quickly. Plus, we add extra steps to treat the problem in a parallel way, therefore, when the problem size is small, it is faster serially because we don’t have these extra steps. It becomes faster in parallel when these extra steps are a small workload compare to the time needed to sort serially.
Let’s finish by visualizing my implementation! We easily distinguish the two different steps of the algorithm in this visualization.