#oneGBSortChallenge and complements about Radix Sort [Difficulty: Medium]
Hi dear readers!
Today, as promised, I will present and explain my challenge called oneGBSortChallenge! Then, complements about Radix Sort will be given, in particular how to deal with negative and real numbers but also with other kinds of data types. Lastly, I will explain what I have done so far and expose some performance results about the Radix Sort I have implemented.
If you have not read my previous blog post, Read this if you want your computer to run your applications faster [Difficulty: Easy], I suggest to read it first because this one is the continuation. It will take you only a few minutes and it is interesting!
Now, let’s talk about the challenge!
What is it?
Beat my time record on sorting 1GB of data using MPI and C++. There are Amazon gift cards to win and it is also a good opportunity to practice or discover HPC and MPI. Everything is guided and made simple to allow also beginners to take part. So no hesitation!
How to take part
Go to the challenge’s git and follow the very simple and short instructions. Everything is detailed there.
If you read my blog posts, you should have some hints on how to defeat me…
Back to Radix Sort
If you have paid attention in the previous blog post, when we have presented the Radix Sort, all examples were with positive integers (unsigned integers). What if we want to sort a list containing reals or both positive and negative numbers? Let’s see an example with signed integers.
Sort signed integers
Usually, to implement the Radix Sort, the base 256 is used and one digit corresponds to one byte. The reasons for that have been explained in my previous post. As soon as we work with the binary representation of numbers for the Radix Sort, whatever the base, we encounter the upcoming issue to sort signed integers.
What is the issue?
For the example, in order to make it faster and simple, we will treat the integers in base 16 and we want to sort int8_t numbers (signed integers stored on one byte). In base 16, one digit corresponds to 4 bits. Indeed, if you take 4 consecutive bits (half a byte), the value you will end up with is between 0 and 15 inclusive (between 0 and F within base 16 notation). Base 16 is named hexadecimal or hex often.
One byte is composed of 8 bits, so there are two base 16 digits in one byte. Thus, the Radix Sort d parameter we have seen in the previous post equals 2 here because all the integers we want to sort have two digits in the chosen base. Besides, we will use the Bucket Sort in the example but notice that we have the same issue and the solutions are the same with the Counting Sort. The Bucket Sort is just more handy to illustrate. Below is the list we want to sort.
The first row is our list in base 10, the last row is the same list but represented in binary signed 2’s complement (like they are really represented and stored in a computer). The middle row is the corresponding base 16 representation. According to the Radix Sort algorithm we have seen in the previous post, we have to first sort the elements based on the last digit (Least Significant Digit). Then the result will be again sorted by the second digit which is the last in this case.
The empty buckets are not represented here to save space. This example also allows revising how the Bucket Sort works. As d equals 2, we still have one iteration to do with the next digit (the Most Significant Digit here).
Now the algorithm is finished and our list is not sorted. So, let’s understand why with unsigned integers it works fine but not with signed integers. The MSB (Most Significant Bit) of unsigned integers, not to be confused with the MSD, is entirely part of the number, helps to code the value of the integer, unlike signed integers where it is used to store the sign. And this makes the difference. If we pay attention, in the final list above, we have the positive integers first and then the negative ones and, whatever the length of the list, we always will have this partition. This is due to the value of the MSB because 0 is used as MSB for positives and 1 for negatives. As we always treat each digit as an unsigned integer, the MSD value of negative integers will always be greater than those of positive ones because of the MSB. Thus, in the final iteration, when we sort according to the MSD, positive integers will always come before the negatives as the value of their MSD is lower. The very good point is that if we look only at the positive section of the list, it is sorted and the same with the negative section. So, the only thing we have to do is to find a way to swap these two parts and we will end up with the desired sorted list. There are several ways to achieve this.
How to fix it?
We could for example loop through the final list and find at which index the negative section begins and then proceed to multiple swaps. However, this solution is not suitable because it requires at least one more loop through the entire list and, as a consequence, increases the complexity of the sort.
A more clever and suitable solution is to sort the list still with the Radix Sort but, according to another key. Until now, we have sorted the numbers according to their bitwise representation, which is perfectly correct and fine for unsigned integers but not for the signed. Plus, the main inconvenient of the Radix Sort is that it can sort only data having a bitwise representation according to how it proceed. Thus, to sort signed integers, we have to find a suitable binary representation of signed integers such that the Radix Sort, executed using that representation as key to sort, ends up with the correct sorted list. Looks maybe complex but actually, it is fairly simple. If we take the bitwise representation of the numbers, invert their MSB and use the output as a key to sort, our problem is solved. The reason is straightforward, we have seen just above, it is because the MSB of negative integers equals 1 and those of positives equals 0 that there is this partition at the end. By inverting the value of this bit, we invert the final placement of the positive and negative sections in the list. Let’s go back to our example to illustrate.
We still want to sort the same list, the numbers are unchanged and we can read them in the first row. The last row contains the keys as explained and the middle row contains the keys within base 16 representation. As the LSD of the numbers have not changed, the result of the first Bucket Sort will be the same as previously (Figure 2). Thus, we can take this result (Figure 2) to continue the Radix Sort with the second and last Bucket Sort.
This time the list is sorted. In practice, there are several ways to invert the MSB of a signed integer and it can also depend on the programming language. Unfortunately, we are not going to go further into details with this aspect today because otherwise, the post would be too long. However, the issue has been highlighted and a solution has been provided.
There are other possibilities to face the problem. I have presented this one because it is the first I have found, succeeded to implement and for me, the simplest. Plus, this solution doesn’t need an additional loop through the list because we can extract the keys when needed during the Bucket Sort (or Counting Sort) via a function dedicated to that. In that way, no need to store them neither. We can extract or get the key of a given signed integer in a negligible constant time so, we are not increasing the complexity.
Careful! If we use this process to sort also unsigned integers, it doesn’t work. By doing that with unsigned integers, we are losing information about the value of the numbers because the MSB is not a sign bit anymore. The keys with unsigned integers have to be their binary representation without any transformation. In other words, the numbers themselves.
Ok, now we know how to sort signed and unsigned integers, but what about reals and other kind of data types?
Sort reals and other data types
Explain in detail how to sort float and double using the Radix Sort would be too much for the purpose here. That is not really complex but requires knowledge about floating-point representation, which is not so straightforward. It is a bit more complex and tricky than with integers but the idea remains the same concerning the Radix Sort. Still a question of dealing and playing around with binary representations.
Sort other data types like dates or hours or whatever, is simple with comparison-based sorting algorithms. We can implement a predicate which takes two instances of our data type and returns if or not the first one is greater than the second. Then we provide this predicate as entry to the sort function accompanied by the list to sort and it is done. But, we can’t generalize that to non-comparison based sorts because, by definition, they don’t compare the elements between themselves. Besides, as said, the Radix Sort needs a binary representation to sort. This is why the Radix Sort has a limited scope, this constraint can be embarrassing. Dates and hours, for example, don’t have a native binary representation, they are not native data types in computers. Often, we use strings or two to three integers to represent and store them. So, to sort them using the Radix Sort, the rule is the following. We have to find a suitable binary representation of our data type such that the Radix Sort, executed using that representation as key to sort, ends up with the desired sorted list. By binary representation here, we have to understand integers, strings or reals because they are the native data types and they have a native binary representation. Finally, in that case, instead of providing a predicate to the sort, we can provide a function allowing to get the binary representation of an instance of the data type. This function is often called a functor or a callback.
If you want more details about sorting float, double, other data types and the Radix Sort in general, I recommend this YouTube video C++Now 2017: M. Skarupke “Sorting in less than O(n log n): Generalizing and optimizing radix sort”.
What I have done until now
I have started first to play around and implement some serial versions of the Radix Sort to be familiar with the algorithm and its variants. The goal was not to implement an optimize serial Radix Sort ready-to-use because we are not going to reinvent the wheel. Indeed, there already are some very good libraries with that, like the boost library and its spreadsort function. I will use one of them when needed. I remind that my project is to implement a reusable C++ library with a clean interface that uses Radix Sort to sort a distributed (or not) array using MPI (if distributed). So, when the array is not distributed I can use, for example, the boost’s spreadsort. My project focuses on distributed arrays using MPI and, so far, I have a great parallel version able to sort all integer types. It is my third parallel version, I have started with a simple version and then optimized it little by little until this third version. My current version also allows the users to provide a function returning a bitwise representation of their data to sort other data types than the native ones. Since for now it can sort only integers, the bitwise representation should be in the form of an integer, but this already offers flexibility and a huge range of possibilities. Besides, I have also profiled my versions and compared them to the std::sort with more than 2 000 nodes using the ICHEC Kay cluster. We will talk about that in the next post!
See you soon!