## Blog: Mr. Mathiesen

Ramón Soto Mathiesen and Johan Thillemann
4th MirageOS hack retreat, Marrakesh
rsm λ spisemisu . com and joe λ celo . io

Abstract

Let A be an array of unique numbers of size n, where the minimum and maximum numbers in the array are min and max respectively.

In this blog post we introduce a parallel algorithm, bfsort, based on Bloom Filters, that can sort the array A in O(log₂ n) time and O(m) space where m is the size of: { min, min + 1, min + 2, … , max - 1, max }, while the sequential version, can sort the array A in O(m) time and O(n) space. We argue that if merge operations of empty arrays that could be stored in CPU-cache, given unlimited cores, then the memory overhead wouldn’t have that big of an impact on the parallel version of the algorithm.

As the algorithm is based on Bloom Filters, both the parallel and sequential versions sort the array A with a collision probability of 0.00001%. The algorithm is parameterized on the collision probability, which can be adjusted for optimal results based on the size of the array.

### 1. Introduction

In this section, we will describe a few data structures and concepts that will ease the understanding of the following sections:

• Sorting a sequence of values could be defined with the following constraint: { a₀ ≤ a₁ ≤ … ≤ aₙ } where a number is less or equal to it’s successor in the sequence. When working with sorting algorithms, it’s important to understand the underlying data structure that stores the values. Arrays allow additions in linear time, as a new array +1 needs to be allocated on another spot in memory, while indexed access are in constant time. Lists allow constant time additions and linear time lookups.

• Bloom Filter is a space-efficient data structure as it saves the necessary information in bits, see Figure 0, which differs from HashTables which will need at least a boolean, normally implemented as a byte, where only one bit is set and the rest are unused, to store the information. With regard to test if an element is in the set, the Bloom Filter can only warranty that an element definitely is not in set while if an element is in the filter, it still has to make a lookup on the underlying data structure. There is a probability that an element is in the filter but it isn’t in the underlying data structure, false positives. The false positives are due to the set of hashes can set the same bit for different words. This is why it’s very important to choose the most space-efficient size of bits, m, in relation to the probability of false positives, p, while choosing the optimal number of hash functions, k, that uniformly distribute the values across the filter [cao]: Figure 0: Example of a Bloom filter by [eppstein], 15 August 2007.
• Asymptotic performance measures:

• DAG: A directed acyclic graph, G = (V,E), in this blog post represents the computations of an algorithm. The vertices, V, represent the instructions to be computed and the edges, E, represent the dependencies between the instructions. If there is a edge, (u,v) ∈ E, the instructions in the vertex v cannot be executed before the instructions in the vertex u.

• Work: The total time of computing all instructions of the algorithm executed on a computer with a single processor, see Figure 1. Work can be mapped to the DAG as all the vertices, V, in the graph. The work law states that if k work can be done in Tₖ(n) time on a computer with k processors and the amount of work is only T(n), where k · Tₖ(n) ≥ T(n), then the execution time of the algorithm with k processors, is limited to the amount of work divided by the amount of processors.

• Span: The execution time of the longest path in a DAG, also know as critical path, see Figure 1. The asymptotic time complexity of the algorithm is dominated by the critical path. The span law is defined as follows: no parallel computer with a limited amount of k processors can execute a parallel algorithm faster than a computer with an unlimited amount of cores. The computer with unlimited amount of processors can always just execute the parallel algorithm with k processors: Tₖ(n) ≥ T∞(n).

### 2. Sorting with Bloom Filters

As mentioned in the previous section when describing the Bloom Filters, in order to uniformly distribute the values across the filter, we must find the optimal relation between bits, probability of false positives and number of hash functions.

For this algorithm, the primary concern is the probability of false positives and not so much the space efficiency or the number of hash functions. For this reason, we choose a very low probability as default value, 0.0000001, even though it means that the amount of bits and hash functions will be slightly bigger in size than normal. Even though it’s a very low probability number, the chosen number ensures that the size of the underlying array in the Bloom Filters doesn’t becomes larger than the length of the array, which could be translated in term of asymptotic complexity as the algorithm would use n extra space, where n is the length of the array.

Once we have decided the parameters for the Bloom Filters, we can then instantiate it and be ready to populate it with the values of the array.

While we populate the Bloom Filter with the values in linear time. It’s worth noticing that even though we calculate several hashes and perform several updates on the Bloom Filter, those operations are still in constant time. We can also find the minimum and maximum values in the array, also in linear time, more specifically in the same traversal of the array as the one that performs the updates in the Bloom Filters.

After the bits of the Bloom Filter are populated with the values of the array, we can now begin to execute the part of the algorithm that will produce a sorted array.

As we have found the minimum and the maximum values in the array, we can now recursively iterate through all the numbers in the range from the minimum number to the maximum number.

The technique we use is to exclude numbers which we definitely know are not in the set. And because we have chose a very little probability of false positives, we can easily argue that numbers that aren’t excluded, most likely are in the initial array.

Therefore we can state that the work is in Θ(m), where m is the length of the interval from the minimum value to the maximum value. As this is a sequential algorithm, the span will also match the work and therefore be in O(m).

It’s worth noticing that we only need n extra space, where n is the length of the array to be sorted.

For an example of the algorithm applied to a list of bytes, please see Figures 2 - 6.

### 3. Parallelizing the algorithm

Readers familiar with Bloom Filters will recognize that there are certain operations in the algorithm described in the previous section, that can be parallelized.

In this section we will write them down and show how we can lower the span, critical path, if an a computer with an unlimited amount of cores is provided and a fork/join approach is used, see pseudocode from Figure 7, which will reduce the global asymptotic time complexity of the algorithm:

#### 3.1 Initialization

In the sequential algorithm, this step is done in linear time by iterating over each of the elements of the array.

• Allocate underlying Bloom Filter array: Allocating the array in memory can be done in constant time while populating it with initial zero values can be done in logarithmic time, O(h), where h = log₂ n, when using a similar parallel algorithm as described in the pseudocode from Figure 7 in this section.

• Seeds and Hashes: As the size of the bits of the underlying Bloom Filter array much greater-than the size values of the seeds and the hashes, therefore none of these operations will have an impact on the time complexity. It’s worth noticing that the random generation of seeds as well as the instantiation of the hashes based on the seeds can be done in parallel as well.

The new span for this step is reduced from O(n) to O(h) where h = log₂ n and n is the length of the array.

#### 3.2 Populating

In the sequential algorithm, this step is done in linear time by iterating over each of the elements of the array.

• Updating the Bloom Filter with the hash values: As described in the sequential algorithm, even though we calculate several hash functions for each value, these operations don’t change the asymptotic time complexity of this step as the operations are in constant time. Therefore, as we are able to parallelize the update of the Bloom Filter with an approach as described in pseudocode from Figure 7 in this section. The new span for this step is reduced from O(n) to O(h) where h = log₂ n and n is the length of the array. It’s worth noticing that Bloom Filter are built on top of the or operation, which is the most suitable operation to be parallelized. As we don’t work with bytes we need to ensure that in order to ensure no race conditions, an architecture with support for concurrent atomic CPU operations [atomic], must be chosen.

• Minimum and maximum: As above, we can also easily calculate these values by storing locally the value of the comparison between two numbers and bobble them all the way to the root of the recursive tree. The new span for this step is reduced from O(n) to O(h) where h = log₂ n is the length of the array. If the architecture supports concurrent atomic CPU operations, then the extra space needed can be reduced to O(1).

As both steps are reduced from O(n) to O(h) where h = log₂ n is the length of the array, we can conclude that the new span for this step is also reduced from O(n) to O(h).

#### 3.3 Sorting

In the sequential algorithm, the span, O(m), is bound to the work which is in Θ(m), where m is the length of the interval from the minimum value to the maximum value.

• Querying the Bloom Filter with the hash values We can easily parallelize the queries of the values contained in interval from the minimum value to the maximum value with a fork/join approach, reducing the span from O(n) to O(h’) where h’ = log₂ m. In order to make this value more understandable, we will showcase the following example with 32-bit unsigned integers. Lets assume that our list contains some 32-bit uints where the minimum value equals 0 and the maximum value equal 4.294.967.295 (2³² - 1). Then our span would be O(32) as log₂ 4.294.967.295 = 32.

• Merging Once we know if a value is definitely not in set then we can return an empty array for that case while for the other case we will return an array of size one containing that element. The merge process will consist of combining two arrays into one and copying the elements, in parallel into the newly created array, with the same approach of fork/join as mentioned in this section. The initialization and merge of the top array will be bounded to n, where n is the size of the array to sort, therefore reducing the span of this process O(lg n) as we can create a new array in constant time and populate it in O(h) where h = log₂ n.

As the second step dominates the first, we can argue that we have reduced from O(m) to O(h) where h = log₂ n is the length of the array, we can conclude that the new span for this step is also reduced from O(n) to O(h).

Sadly, we will have an overhead in memory usage as we are storing empty arrays, for at most m values. But, since the merge of empty arrays, which are minimalistic in size, could be stored on the unlimited cores CPU-cache, then the memory overhead wouldn’t have that big of an impact on the algorithm.

#### 3.4 Summary

By combining the time complexities from the different steps:

Tₖ(n) = O(h) + O(h) + O(h) = 3 · O(h) = O(h)

where h = log₂ n and n is the length of the array. Therefore, we can conclude that the work is Θ(m) where m is the length of the interval from the minimum value to the maximum value in the array to be sorted and the span is in logarithmic time.

As mentioned in the previous section, there will be an overhead in memory usage as we are storing empty arrays for at most m values. If the merge operations of empty arrays, which are minimalistic in size, could be stored on each of the unlimited cores CPU-cache, then the memory overhead wouldn’t have that big of an impact on the algorithm.

### 4. Implementation

The algorithms are implemented in F#, which could easily be described as the OCaml implementation for .NET, as we already had a working native version of the Bloom Filter data structure. Here are a few implementation details that are worth noticing:

The hash algorithm chosen is MurmurHash3 [smhasher] as it has support for a seed which can be used to generate different hashes for the same key by re-using the same algorithm. In order to get a better seeds and hereby ensuring a more uniform distribution of values across the filter, see Figure 8 for a graphical representation, the RNGCryptoServiceProvider is used for random number generation instead of System.Random. See Hash and Random modules for more information. Figure 8: Graphical representation of values uniformly distributed across the filter.

Usage of Interlocked.CompareExchange [interlocked] which allows concurrent atomic CPU operations to perform updates on data values with minimal overhead, see Figure 9. In order to use these operations with the space-efficient data structure, we will need to instantiate our underlying array with ref ‘a values. In case the reader is not convinced that the minimum and maximum numbers can be found in parallel with these operations, it can easily be refactored to a binary tree of height h = log₂ n by using a (fork/join) approach, where the value of the comparisons are stored locally. It’s easier to see that the span, critical path, is also O(lg n) time but with a bit more space overhead O(n) as described in the previous section. Please look into the Atomic and BloomFilter modules for more information. Figure 9: Several processors reading from the same shared memory but only one can write atomically in a clock cycle.

In the parallel versions of the algorithm, by using true parallelism Array.Parallel (system threads) compared to using concurrency with Async.StartChild (lightweight threads) in a fork/join approach, shows how there is an extra memory overhead in the first as the operations require to allocate all the values in the range from minimum to maximum in an array in order to filter out values that are definitely not in set, while in the second approach there is less memory overhead as we can use the same approach as described in the previous paragraph when finding in parallel the minimum and maximum numbers in an array to merge lists by using lazy list that allows concatenation in constant time O(1). See the BloomFilter and List.Lazy modules for more information.

A full implementation can be seen in Appendix A: Source Code.

### 5. Concluding remarks

We have presented both a sequential as well as a parallel algorithm that uses the Bloom Filter data structure as an underlying component in order to sort an array with a very small collision probability of 0.00001%, which can be incremented and decremented for optimal results based on the size of the array.

We argue that the sequential algorithms asymptotic complexity is in Θ(m) time and space as the work is in Θ(m), where m is the length of the interval from the minimum value to the maximum value. As this is a sequential algorithm, the span will also match the work. Also it’s worth noticing that we only need n extra space, where n is the length of the array to be sorted.

And we argue that the parallel algorithms asymptotic complexity is in logarithmic time time as the dominating operation is bounded to the span O(lg n) where h = log₂ n is the length of the array. We also argue that the work is is Θ(m) where m is the length of the interval from the minimum value to the maximum value in the array and therefore we will need m extra space.

#### 5.1 Notes:

• In order to ease the understanding of the span for the parallel algorithm, the following example was used: Lets assume that our list contains some 32-bit uints where the minimum value equals 0 and the maximum value equal 4.294.967.295 (2³² - 1). Then our span would be O(32) as log₂ 4.294.967.295 = 32.

• In order to limit the amount of work, it would be ideal to use this sorting algorithm with data that is to not scattered, as the amount of work is defined by the interval from minimum and maximum values of the array.

• As this is a probabilistic sorting algorithm, you will have a small chance that the given array is not sorted after executing the algorithm. Therefore this algorithm shouldn’t be used critical operations.

• It can easily be shown if an array is sorted in O(lg n), where n is the size of the array, by using the fork/join parallel approach mentioned in section 3. Parallelizing the algorithm, where you just need to ensure that the head of the merged leaf array is lower or equal to the head of the right array and bubble that boolean value up to the root of the tree. In case the array is not sorted, the sorting algorithm could be executed again but by providing an even lower probability which would result in a bit more of memory overhead and work.

 [bloom] Bloom, Burton H. (1970), “Space/Time Trade-offs in Hash Coding with Allowable Errors”, Communications of the ACM, 13 (7): 422–426, doi:10.1145/362686.362692 [eppstein] David_Eppstein/Gallery https://commons.wikimedia.org/wiki/User:David_Eppstein/Gallery Accessed: 2017-12-03 [cao] Bloom Filters - the math http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html Accessed: 2017-12-03 [smhasher] SMHasher is a test suite designed to test the distribution, collision, and performance properties of non-cryptographic hash functions. https://github.com/aappleby/smhasher Accessed: 2017-12-03 [atomic] Atomic Operations. https://software.intel.com/en-us/node/506090 Accessed: 2017-12-03 [interlocked] Interlocked Operations. https://docs.microsoft.com/en-us/dotnet/standard/threading/interlocked-operations Accessed: 2017-12-03

### Bibtex

If you want to cite this blog post, you can use the following BibTeX information: