Fork me on GitHub
Ramón Soto Mathiesen and Johan Thillemann
4th MirageOS hack retreat, Marrakesh
rsm λ spisemisu . com and joe λ celo . io


Abstract

Let A be an array of unique numbers of size n, where the minimum and maximum numbers in the array are min and max respectively.

In this blog post we introduce a parallel algorithm, bfsort, based on Bloom Filters, that can sort the array A in O(log n) time and O(m) space where m is the size of: { min, min + 1, min + 2, … , max - 1, max }, while the sequential version, can sort the array A in O(m) time and O(n) space. We argue that if merge operations of empty arrays that could be stored in CPU-cache, given unlimited cores, then the memory overhead wouldn’t have that big of an impact on the parallel version of the algorithm.

As the algorithm is based on Bloom Filters, both the parallel and sequential versions sort the array A with a collision probability of 0.00001%. The algorithm is parameterized on the collision probability, which can be adjusted for optimal results based on the size of the array.

1. Introduction

In this section, we will describe a few data structures and concepts that will ease the understanding of the following sections:

  • Sorting a sequence of values could be defined with the following constraint: { a₀ ≤ a₁ ≤ … ≤ aₙ } where a number is less or equal to it’s successor in the sequence. When working with sorting algorithms, it’s important to understand the underlying data structure that stores the values. Arrays allow additions in linear time, as a new array +1 needs to be allocated on another spot in memory, while indexed access are in constant time. Lists allow constant time additions and linear time lookups.

  • Bloom Filter is a space-efficient data structure as it saves the necessary information in bits, see Figure 0, which differs from HashTables which will need at least a boolean, normally implemented as a byte, where only one bit is set and the rest are unused, to store the information. With regard to test if an element is in the set, the Bloom Filter can only warranty that an element definitely is not in set while if an element is in the filter, it still has to make a lookup on the underlying data structure. There is a probability that an element is in the filter but it isn’t in the underlying data structure, false positives. The false positives are due to the set of hashes can set the same bit for different words. This is why it’s very important to choose the most space-efficient size of bits, m, in relation to the probability of false positives, p, while choosing the optimal number of hash functions, k, that uniformly distribute the values across the filter [cao]:

Figure 0: Example of a Bloom filter by [eppstein], 15 August 2007.
  • Asymptotic performance measures:

    • DAG: A directed acyclic graph, G = (V,E), in this blog post represents the computations of an algorithm. The vertices, V, represent the instructions to be computed and the edges, E, represent the dependencies between the instructions. If there is a edge, (u,v) ∈ E, the instructions in the vertex v cannot be executed before the instructions in the vertex u.

    • Work: The total time of computing all instructions of the algorithm executed on a computer with a single processor, see Figure 1. Work can be mapped to the DAG as all the vertices, V, in the graph. The work law states that if k work can be done in Tₖ(n) time on a computer with k processors and the amount of work is only T(n), where k · Tₖ(n) ≥ T(n), then the execution time of the algorithm with k processors, is limited to the amount of work divided by the amount of processors.

    • Span: The execution time of the longest path in a DAG, also know as critical path, see Figure 1. The asymptotic time complexity of the algorithm is dominated by the critical path. The span law is defined as follows: no parallel computer with a limited amount of k processors can execute a parallel algorithm faster than a computer with an unlimited amount of cores. The computer with unlimited amount of processors can always just execute the parallel algorithm with k processors: Tₖ(n) ≥ T∞(n).

Figure 1: Representation of sequential vs parallel computations as DAGs.

2. Sorting with Bloom Filters

As mentioned in the previous section when describing the Bloom Filters, in order to uniformly distribute the values across the filter, we must find the optimal relation between bits, probability of false positives and number of hash functions.

[|136; 165; 6; 149; 68; 30; 2; 187; 106; 1; 219; 90; 119; 14; 72; 124; 222; 25;
179; 190; 237; 118; 177; 162; 146; 199; 195; 182; 0; 234|]
Figure 2: Random array of 31 unique bytes.

For this algorithm, the primary concern is the probability of false positives and not so much the space efficiency or the number of hash functions. For this reason, we choose a very low probability as default value, 0.0000001, even though it means that the amount of bits and hash functions will be slightly bigger in size than normal. Even though it’s a very low probability number, the chosen number ensures that the size of the underlying array in the Bloom Filters doesn’t becomes larger than the length of the array, which could be translated in term of asymptotic complexity as the algorithm would use n extra space, where n is the length of the array.

(31u, 1e-07, 23,
[|2632944486u; 176026093u; 2853052920u; 534382327u; 2173353107u; 840630065u;
4181145030u; 878918235u; 33641635u; 1504134878u; 136075251u; 3384303498u;
2194677699u; 768245312u; 3208224496u; 2410081556u; 219847018u; 2055198962u;
3112189164u; 4023218364u; 1356830802u; 3285656209u; 449204162u|])
Figure 3: Info parameters { number of bits: 31 * 32, probability of collision: 0.00001%, number of hashes: 23 and list of hashes: ... }.

Once we have decided the parameters for the Bloom Filters, we can then instantiate it and be ready to populate it with the values of the array.

[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
[|"□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"; "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□"|]
Figure 4: Empty instantiated Bloom Filter.

While we populate the Bloom Filter with the values in linear time. It’s worth noticing that even though we calculate several hashes and perform several updates on the Bloom Filter, those operations are still in constant time. We can also find the minimum and maximum values in the array, also in linear time, more specifically in the same traversal of the array as the one that performs the updates in the Bloom Filters.

[|"■□□■■□□□□□■□■■■□□□■■□■□□□□□□■□□□"; "■■□□□□□□□□□■■□■■□■■□□■□■□■□■■■□□"|]
[|"■□□□□□□□□□■□■□■■□□□□■■■□□■■■□□■□"; "■□■■■■□□■■□■□■■■■□□■□■■□□■■□■□■□"|]
[|"□□■■□□■■■■■■■□■□□■□■■■□■■□■□■■□■"; "□■■□■□□□■□■□□□□□■■□□■□□■■■□□□□□□"|]
[|"■□■□■□■■□□■■■□□□■□■□■□■■■□□□■■□■"; "□■□□□□□■■□■□■■□■□□■■■□■□□□■■■■■■"|]
[|"□□□■■□□□□□□■□□■□■□■□■□■■□□■■□■□□"; "□□□□□■□■□■□■□■□■□□■■■■■□■□■■■□■■"|]
[|"■□■□□□■□□■□■■□■■□□□□■■■□■■■□■□■□"; "□■■■□■□■□■□□□□□■■□□□□■■□□■■□□□□■"|]
[|"□□□□■□□□□□□□□□□□■□□□□■□■■■□□■■□■"; "■□□■□□■■■□□■■□□■□□□■■□□■□■■■■□■□"|]
[|"□■□□■■□■■■■■■□■■■■□□□■□■□■□■■□■□"; "■■■□□□■□■■□■■□■■■■□□■■■□□■□□■□□■"|]
[|"■□□■■■■□■■■□■□■■□■■■■□□■■□■□■■■□"; "■□■□■■□■□■■■□□■■□■□□■■□■■■■□□□□□"|]
[|"■■■■□□□■□■■■□□■■■■□■■■■■■■□□■■□□"; "□■□■■■■■□□■□■■■□■■□□□■■□□□□■■□■■"|]
[|"□■□■□□□□■□□□■■□■□■□□■■■□■□■□■■■□"; "■□□□□□■■□■■■□□□■□■□□■□■■□■■□■□□■"|]
[|"□□■■■□□□□■□□■□■□■□■■□□■□□□■□■□■□"; "□□□□□□■□□□□■□■□■□□■■■□■□□■□□□■□□"|]
[|"□■□■■□□□■■□■■■■□□□■□■□□□□■□□□■■□"; "■□□■□□□□■■■■□■□□■□■■□■□□□■■■□□□□"|]
[|"□■□■■□□■■■■■■■□■■■■■■■□□■■□■□□□□"; "■■■■■□■□■□■□■■■□□■□■□■□■■□□■□□■□"|]
[|"□□□■■■□■□■■□■□■■□□■□□□□□■■■■□■■■"; "□■□■■□□□■□□□□□□□■□■□■■□■□□■■■□□□"|]
[|"■□■■■□■■■■□■□□□□■■■□□□■■□□□□■■□■"; "□■□■■■□□■□□□■■■□□■■□■■□■■□■□□■■■"|]
Figure 5: Bloom Filter bits populated with the unique bytes from the array.

After the bits of the Bloom Filter are populated with the values of the array, we can now begin to execute the part of the algorithm that will produce a sorted array.

As we have found the minimum and the maximum values in the array, we can now recursively iterate through all the numbers in the range from the minimum number to the maximum number.

The technique we use is to exclude numbers which we definitely know are not in the set. And because we have chose a very little probability of false positives, we can easily argue that numbers that aren’t excluded, most likely are in the initial array.

[|0; 1; 2; 6; 14; 25; 30; 68; 72; 90; 106; 118; 119; 124; 136; 146; 149; 162;
165; 177; 179; 182; 187; 190; 195; 199; 219; 222; 234; 237|]
Figure 6: Array is sorted after algorithm execution.

Therefore we can state that the work is in Θ(m), where m is the length of the interval from the minimum value to the maximum value. As this is a sequential algorithm, the span will also match the work and therefore be in O(m).

It’s worth noticing that we only need n extra space, where n is the length of the array to be sorted.

For an example of the algorithm applied to a list of bytes, please see Figures 2 - 6.

3. Parallelizing the algorithm

Readers familiar with Bloom Filters will recognize that there are certain operations in the algorithm described in the previous section, that can be parallelized.

1
2
3
4
5
6
7
8
9
10
let rec aux f lower high = parallel {
if lower < high then
let mid = lower + high >>> 1
let! fork = spawn <| aux f lower mid
let! main = aux f (mid+1) high
let! join = fork
return Util.merge join main
else
return if f lower then [lower] else []
}
Figure 7: Pseudocode of a fork/join implementation in F#, an ML-family programming language.

In this section we will write them down and show how we can lower the span, critical path, if an a computer with an unlimited amount of cores is provided and a fork/join approach is used, see pseudocode from Figure 7, which will reduce the global asymptotic time complexity of the algorithm:

3.1 Initialization

In the sequential algorithm, this step is done in linear time by iterating over each of the elements of the array.

  • Allocate underlying Bloom Filter array: Allocating the array in memory can be done in constant time while populating it with initial zero values can be done in logarithmic time, O(h), where h = log₂ n, when using a similar parallel algorithm as described in the pseudocode from Figure 7 in this section.

  • Seeds and Hashes: As the size of the bits of the underlying Bloom Filter array much greater-than the size values of the seeds and the hashes, therefore none of these operations will have an impact on the time complexity. It’s worth noticing that the random generation of seeds as well as the instantiation of the hashes based on the seeds can be done in parallel as well.

The new span for this step is reduced from O(n) to O(h) where h = log₂ n and n is the length of the array.

3.2 Populating

In the sequential algorithm, this step is done in linear time by iterating over each of the elements of the array.

  • Updating the Bloom Filter with the hash values: As described in the sequential algorithm, even though we calculate several hash functions for each value, these operations don’t change the asymptotic time complexity of this step as the operations are in constant time. Therefore, as we are able to parallelize the update of the Bloom Filter with an approach as described in pseudocode from Figure 7 in this section. The new span for this step is reduced from O(n) to O(h) where h = log₂ n and n is the length of the array. It’s worth noticing that Bloom Filter are built on top of the or operation, which is the most suitable operation to be parallelized. As we don’t work with bytes we need to ensure that in order to ensure no race conditions, an architecture with support for concurrent atomic CPU operations [atomic], must be chosen.

  • Minimum and maximum: As above, we can also easily calculate these values by storing locally the value of the comparison between two numbers and bobble them all the way to the root of the recursive tree. The new span for this step is reduced from O(n) to O(h) where h = log₂ n is the length of the array. If the architecture supports concurrent atomic CPU operations, then the extra space needed can be reduced to O(1).

As both steps are reduced from O(n) to O(h) where h = log₂ n is the length of the array, we can conclude that the new span for this step is also reduced from O(n) to O(h).

3.3 Sorting

In the sequential algorithm, the span, O(m), is bound to the work which is in Θ(m), where m is the length of the interval from the minimum value to the maximum value.

  • Querying the Bloom Filter with the hash values We can easily parallelize the queries of the values contained in interval from the minimum value to the maximum value with a fork/join approach, reducing the span from O(n) to O(h’) where h’ = log₂ m. In order to make this value more understandable, we will showcase the following example with 32-bit unsigned integers. Lets assume that our list contains some 32-bit uints where the minimum value equals 0 and the maximum value equal 4.294.967.295 (2³² - 1). Then our span would be O(32) as log₂ 4.294.967.295 = 32.

  • Merging Once we know if a value is definitely not in set then we can return an empty array for that case while for the other case we will return an array of size one containing that element. The merge process will consist of combining two arrays into one and copying the elements, in parallel into the newly created array, with the same approach of fork/join as mentioned in this section. The initialization and merge of the top array will be bounded to n, where n is the size of the array to sort, therefore reducing the span of this process O(lg n) as we can create a new array in constant time and populate it in O(h) where h = log₂ n.

As the second step dominates the first, we can argue that we have reduced from O(m) to O(h) where h = log₂ n is the length of the array, we can conclude that the new span for this step is also reduced from O(n) to O(h).

Sadly, we will have an overhead in memory usage as we are storing empty arrays, for at most m values. But, since the merge of empty arrays, which are minimalistic in size, could be stored on the unlimited cores CPU-cache, then the memory overhead wouldn’t have that big of an impact on the algorithm.

3.4 Summary

By combining the time complexities from the different steps:

Tₖ(n) = O(h) + O(h) + O(h) = 3 · O(h) = O(h)

where h = log₂ n and n is the length of the array. Therefore, we can conclude that the work is Θ(m) where m is the length of the interval from the minimum value to the maximum value in the array to be sorted and the span is in logarithmic time.

As mentioned in the previous section, there will be an overhead in memory usage as we are storing empty arrays for at most m values. If the merge operations of empty arrays, which are minimalistic in size, could be stored on each of the unlimited cores CPU-cache, then the memory overhead wouldn’t have that big of an impact on the algorithm.

4. Implementation

The algorithms are implemented in F#, which could easily be described as the OCaml implementation for .NET, as we already had a working native version of the Bloom Filter data structure. Here are a few implementation details that are worth noticing:

The hash algorithm chosen is MurmurHash3 [smhasher] as it has support for a seed which can be used to generate different hashes for the same key by re-using the same algorithm. In order to get a better seeds and hereby ensuring a more uniform distribution of values across the filter, see Figure 8 for a graphical representation, the RNGCryptoServiceProvider is used for random number generation instead of System.Random. See Hash and Random modules for more information.

Figure 8: Graphical representation of values uniformly distributed across the filter.

Usage of Interlocked.CompareExchange [interlocked] which allows concurrent atomic CPU operations to perform updates on data values with minimal overhead, see Figure 9. In order to use these operations with the space-efficient data structure, we will need to instantiate our underlying array with ref ‘a values. In case the reader is not convinced that the minimum and maximum numbers can be found in parallel with these operations, it can easily be refactored to a binary tree of height h = log₂ n by using a (fork/join) approach, where the value of the comparisons are stored locally. It’s easier to see that the span, critical path, is also O(lg n) time but with a bit more space overhead O(n) as described in the previous section. Please look into the Atomic and BloomFilter modules for more information.

Figure 9: Several processors reading from the same shared memory but only one can write atomically in a clock cycle.

In the parallel versions of the algorithm, by using true parallelism Array.Parallel (system threads) compared to using concurrency with Async.StartChild (lightweight threads) in a fork/join approach, shows how there is an extra memory overhead in the first as the operations require to allocate all the values in the range from minimum to maximum in an array in order to filter out values that are definitely not in set, while in the second approach there is less memory overhead as we can use the same approach as described in the previous paragraph when finding in parallel the minimum and maximum numbers in an array to merge lists by using lazy list that allows concatenation in constant time O(1). See the BloomFilter and List.Lazy modules for more information.

A full implementation can be seen in Appendix A: Source Code.

5. Concluding remarks

We have presented both a sequential as well as a parallel algorithm that uses the Bloom Filter data structure as an underlying component in order to sort an array with a very small collision probability of 0.00001%, which can be incremented and decremented for optimal results based on the size of the array.

We argue that the sequential algorithms asymptotic complexity is in Θ(m) time and space as the work is in Θ(m), where m is the length of the interval from the minimum value to the maximum value. As this is a sequential algorithm, the span will also match the work. Also it’s worth noticing that we only need n extra space, where n is the length of the array to be sorted.

And we argue that the parallel algorithms asymptotic complexity is in logarithmic time time as the dominating operation is bounded to the span O(lg n) where h = log₂ n is the length of the array. We also argue that the work is is Θ(m) where m is the length of the interval from the minimum value to the maximum value in the array and therefore we will need m extra space.

5.1 Notes:

  • In order to ease the understanding of the span for the parallel algorithm, the following example was used: Lets assume that our list contains some 32-bit uints where the minimum value equals 0 and the maximum value equal 4.294.967.295 (2³² - 1). Then our span would be O(32) as log₂ 4.294.967.295 = 32.

  • In order to limit the amount of work, it would be ideal to use this sorting algorithm with data that is to not scattered, as the amount of work is defined by the interval from minimum and maximum values of the array.

  • As this is a probabilistic sorting algorithm, you will have a small chance that the given array is not sorted after executing the algorithm. Therefore this algorithm shouldn’t be used critical operations.

  • It can easily be shown if an array is sorted in O(lg n), where n is the size of the array, by using the fork/join parallel approach mentioned in section 3. Parallelizing the algorithm, where you just need to ensure that the head of the merged leaf array is lower or equal to the head of the right array and bubble that boolean value up to the root of the tree. In case the array is not sorted, the sorting algorithm could be executed again but by providing an even lower probability which would result in a bit more of memory overhead and work.

References

[bloom] Bloom, Burton H. (1970), “Space/Time Trade-offs in Hash Coding with Allowable Errors”, Communications of the ACM, 13 (7): 422–426, doi:10.1145/362686.362692
[eppstein] David_Eppstein/Gallery
https://commons.wikimedia.org/wiki/User:David_Eppstein/Gallery
Accessed: 2017-12-03
[cao] Bloom Filters - the math
http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
Accessed: 2017-12-03
[smhasher] SMHasher is a test suite designed to test the distribution, collision, and performance properties of non-cryptographic hash functions.
https://github.com/aappleby/smhasher
Accessed: 2017-12-03
[atomic] Atomic Operations.
https://software.intel.com/en-us/node/506090
Accessed: 2017-12-03
[interlocked] Interlocked Operations.
https://docs.microsoft.com/en-us/dotnet/standard/threading/interlocked-operations
Accessed: 2017-12-03

Bibtex

If you want to cite this blog post, you can use the following BibTeX information:

@online{bfsort,
author = {Ramón Soto Mathiesen and Johan Thillemann},
title = {Bloom Filter Sort (bfsort)},
year = 2017,
url = {http://blog.stermon.com/articles/2017/12/06/bloom-filter-sort-bfsort},
urldate = {2017-12-06}
}

Appendix A: Source Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
#!/usr/bin/env fsharpi

(* This construct is for ML compatibility. The syntax '(typ,...,typ) ident'
is not used in F# code. Consider using 'ident<typ,...,typ>' instead. *)

#nowarn "62"

(*
bfsort: Bloom Filter Sort
*)


let inline (>>=) m f = Option.bind f m

[<RequireQualifiedAccess>]
module Assert =

let sorted : int array -> bool =
fun xs ->
let rec aux acc = function
| [] | [ _ ] -> acc
| x :: [ y ] -> (x <= y && acc)
| x :: y :: xs -> aux (x <= y && acc) (y :: xs)
xs |> Array.toList |> aux true

[<RequireQualifiedAccess>]
module Atomic =

open System.Threading

(*
Interlocked Class:

https://msdn.microsoft.com/en-us/library/
system.threading.interlocked(v=vs.110).aspx
*)

let rec private aux : (int -> int -> int) -> int ref -> int -> int =
fun f a b ->
let x = !a
let y = f x b
match x = Interlocked.CompareExchange(a,y,x) with
| true -> y
| false -> aux f a b

let or' : int ref -> int -> int =
fun x y -> aux ( ||| ) x y

let min : int ref -> int -> int =
fun x y -> aux min x y

let max : int ref -> int -> int =
fun x y -> aux max x y

[<RequireQualifiedAccess>]
module Binary =

open System

let mask = function | true -> '' | false -> ''

let fromInt : int -> int -> (bool -> char) option -> string option =
fun bits n mask ->
let con = function | '1' -> true | _____ -> false
let def = function | true -> '1' | false -> '0'
let bin =
Convert.ToString(n,2).PadLeft(bits,'0')
|> Seq.map (con >> (defaultArg mask def) >> string)
|> Seq.fold ( + ) ""

match bin |> String.length = bits with
| true -> Some bin
| false -> None

[<RequireQualifiedAccess>]
module Bytes =

[<RequireQualifiedAccess>]
module BigEndian =

open System

let private littleEndian = BitConverter.IsLittleEndian

let toUint32 : byte array -> uint32 =
fun bs ->
let bs' = if littleEndian then bs |> Array.rev else bs
(bs',0) |> BitConverter.ToUInt32

[<RequireQualifiedAccess>]
module Random =

open System

let private r = Random ()

let byte ( ) = r.Next(0 , 1 <<< 8)
let next lb ub = r.Next(lb, ub )

[<RequireQualifiedAccess>]
module Crypto =

open System.Security.Cryptography

let next () =

use rng = new RNGCryptoServiceProvider ()
let bs = Array.Parallel.init 32 (fun _ -> 0uy)

rng.GetNonZeroBytes bs
Bytes.BigEndian.toUint32 bs

[<RequireQualifiedAccess>]
module Hash =

let private string2bytes : string -> byte array =
System.Text.Encoding.UTF8.GetBytes

(*
FVN Hash:

http://isthe.com/chongo/tech/comp/fnv/
*)

let fnv1aHash : string -> uint32 =
fun key ->
(* Example: ./fnv1a32 -s foo => 0xa9f37ed7 *)
let fnvp = (1 <<< 24) + (1 <<< 8) + 0x93 |> uint32
let fnvob = 2166136261u

let b = key |> string2bytes

let rec aux h = function
| i when i < (b.Length) ->
let h' =
h ^^^ (b.[i] |> uint32)
|> fun x -> x * fnvp
aux h' (i+1)
| _ -> h
aux fnvob 0

(*
Murmur Hash:

https://sites.google.com/site/murmurhash/
*)

let murmur3Hash : uint32 -> string -> uint32 =
fun seed key ->
(* Example: ./murmur3 foo => x86_32: b12f489e (seed 42) *)
let rotl x r = (x <<< r) ||| (x >>> (32 - r))
let fmix h =
h
|> fun x -> x ^^^ (x >>> 16)
|> fun x -> x * 0x85ebca6bu
|> fun x -> x ^^^ (x >>> 13)
|> fun x -> x * 0xc2b2ae35u
|> fun x -> x ^^^ (x >>> 16)
let getblock b i =
System.BitConverter.ToUInt32(value = b, startIndex = i)

let data = key |> string2bytes
let len = data |> Array.length
let nblocks = len >>> 2 (* equivalent to len / 4 but faster *)
let h1 = seed
let c1 = 0xcc9e2d51u
let c2 = 0x1b873593u

(* body *)
let rec body h = function
| i when i < (nblocks) ->
let k1 =
getblock data (i * 4)
|> fun x -> x * c1
|> fun x -> rotl x 15
|> fun x -> x * c2
let h' =
h ^^^ k1
|> fun x -> rotl x 13
|> fun x -> x * 5u + 0xe6546b64u
body h' (i+1)
| _ -> h
let h1' = body h1 0

(* tail *)
let tail = nblocks * 4
let rec tail' (k,h) = function
| 0 -> h
| 1 ->
let k' =
k ^^^ (uint32 data.[tail])
|> fun x -> x * c1
|> fun x -> rotl x 15
|> fun x -> x * c2
let h' = h ^^^ k'
tail' (k',h') (0)
| i ->
let k' =
(uint32 data.[tail + (i - 1)]) <<< (1 <<< (i + 1))
|> fun x -> k ^^^ x
tail' (k',h) (i-1)
let h1'' = tail' (0u,h1') (len &&& 3)

(* finalization *)
h1'' ^^^ (uint32 len)
|> fun x -> x |> fmix

[<RequireQualifiedAccess>]
module Logarithm =

(*
ISO 31-11:

https://en.wikipedia.org/wiki/
ISO_31-11#Exponential_and_logarithmic_functions
*)

let ln : double -> double = log
let lb : double -> double = fun x -> System.Math.Log (x,2.)
let lg : double -> double = fun x -> System.Math.Log10 (x )

[<RequireQualifiedAccess>]
module List =

[<RequireQualifiedAccess>]
module Lazy =

type 'a lazylist = Nil | Cons of 'a * (unit -> 'a lazylist)

let rec merge : 'a lazylist * 'a lazylist -> 'a lazylist = function
| xs, Nil | Nil, xs -> xs
| Cons (x,xs), Cons (y,ys) ->
let x' = min x y
let y' = max x y
Cons (x', fun () -> merge (xs (), Cons (y', fun () -> ys())))

let rec toList : 'a lazylist -> 'a list = function
| Nil -> []
| Cons (x, xs) -> x :: (toList (xs()))

let toArray : 'a lazylist -> 'a array = fun xs -> toList xs |> List.toArray

[<RequireQualifiedAccess>]
module BloomFilter =

type bloomFilter = private Config of config
and private config =
{ parameters : parameters
bits : int ref array
seeds : uint32 array
hashes : (string -> uint32) array
}
and private parameters =
{ n : uint32
p : double
k : int
m : uint32
}

let init : uint32 -> double -> uint32 array option -> bloomFilter option =
fun n p seeds ->
let m =
(* ceil up to nearest 2^n in order to use &&& (n-1) instead of % n *)
-(float n * Logarithm.ln p) / (Logarithm.ln 2. ** 2.)
(* round to nearest 32 bits *)
|> uint32 >>> 5 <<< 5
let k =
(*
Optimal number of hashes:

http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
*)

(float m / float n) * Logarithm.ln 2.
|> (ceil >> int)
let bs = Array.Parallel.init (m >>> 5 |> int) (fun _ -> ref 0)

match seeds with
| Some ss ->
if ss |> Array.length = k then
Config
{ parameters =
{ n = n
p = p
k = k
m = m
}
bits = bs
seeds = ss
hashes = ss |> Array.Parallel.map Hash.murmur3Hash
}
|> Some
else
None
| None ->
let ss = Array.Parallel.init k (fun _ -> Random.Crypto.next ())
Config
{ parameters =
{ n = n
p = p
k = k
m = m
}
bits = bs
seeds = ss
hashes = ss |> Array.Parallel.map Hash.murmur3Hash
}
|> Some

let info : bloomFilter -> (uint32 * double * int * uint32 array) option =
fun (Config { parameters = { n = n; p = p; k = k }; seeds = ss; })->
(n,p,k,ss) |> Some

let private bytebit hash =
let byte = hash >>> 5
let bit = hash - (byte <<< 5)
byte,bit

let private sbit : int -> int = fun n -> 1 <<< (31-n)
let private gbit : int -> int -> int =
fun i32 n ->
System.Convert.ToString(i32,2).PadLeft(32,'0').[n] |> string |> int

let add : string -> bloomFilter -> unit option =
fun key (Config { parameters = { m = m; }; hashes = hs; bits = bs; }) ->
hs
|> Array.Parallel.map (fun f -> f key % m |> (int >> bytebit))
|> Array.Parallel.iter(fun (x,y) -> Atomic.or' bs.[x] (sbit y) |> ignore)
|> Some

let query : string -> bloomFilter -> bool option =
fun key (Config { parameters = { m = m; }; hashes = hs; bits = bs; }) ->
hs
|> Array.Parallel.map (fun f -> f key % m |> (int >> bytebit))
|> Array.fold(fun a (x,y) -> (gbit !bs.[x] y = 1) && a) true
|> Some

let print : bloomFilter -> string array option =
fun (Config { bits = bs; }) ->
let masked = fun n -> n |> Binary.fromInt 32 <| Some Binary.mask
bs
|> Array.Parallel.map (fun n -> masked !n)
|> Array.Parallel.choose id
|> Some

let length : bloomFilter -> int option =
fun (Config { bits = bs; }) ->
bs
|> Array.length
|> Some

let sort : int array -> double option -> int array =
fun xs probability ->
let size = xs |> Array.length
let minimum = System.Int32.MaxValue |> ref
let maximum = System.Int32.MinValue |> ref

(*
Probability of collisions:

0.1 -> 10%
0.01 -> 1%
0.001 -> 0.1%
0.0001 -> 0.01%
0.00001 -> 0.001%
0.000001 -> 0.0001%
0.0000001 -> 0.00001% (still with 2*n space)
*)

let collision = defaultArg probability 1.e-7

(*
Create & Update Bloom Filter, Min & Max values in parallel
*)

let bf = init (uint32 size) collision None
xs
|> Array.Parallel.iter (
fun k ->
Atomic.min minimum k |> ignore
Atomic.max maximum k |> ignore
bf >>= add (string k) |> ignore
)

(*
Create a new array with sorted values in sequential.

Note: Extra space is bound to size of array
*)

let rec aux : int list -> int -> int array =
fun acc n ->
let acc' =
if bf >>= query (string n) = Some true then
n :: acc
else
acc
match n with
| i when i = !minimum -> List.toArray acc'
| i -> aux acc' (i-1)
aux [] !maximum

[<RequireQualifiedAccess>]
module Parallel =

let sort : int array -> double option -> int array =
fun xs probability ->
let size = xs |> Array.length
let minimum = System.Int32.MaxValue |> ref
let maximum = System.Int32.MinValue |> ref

(*
Probability of collisions:

0.1 -> 10%
0.01 -> 1%
0.001 -> 0.1%
0.0001 -> 0.01%
0.00001 -> 0.001%
0.000001 -> 0.0001%
0.0000001 -> 0.00001% (still with 2*n space)
*)

let collision = defaultArg probability 1.e-7

(*
Create & Update Bloom Filter, Min & Max values in parallel
*)

let bf = init (uint32 size) collision None
xs
|> Array.Parallel.iter (
fun k ->
Atomic.min minimum k |> ignore
Atomic.max maximum k |> ignore
bf >>= add (string k) |> ignore
)

(*
Create a new array with sorted values in parallel.

Note: Parallel version would require more than O(N) of extra space
*)

Array.Parallel.init (!maximum - !minimum + 1) (
fun i ->
let n = !minimum + i
if bf >>= query (string n) = Some true then
Some n
else
None
)
|> Array.Parallel.choose id

[<RequireQualifiedAccess>]
module Async =

[<RequireQualifiedAccess>]
module Parallel =

let merge : 'a array -> 'a array -> 'a array =
fun xs ys ->
let n,m = Array.length xs, Array.length ys
Array.Parallel.init (n+m) (
fun i -> if i < n then xs.[i] else ys.[i-n]
)

let sort : int array -> double option -> int array =
fun xs probability ->
let size = xs |> Array.length
let minimum = System.Int32.MaxValue |> ref
let maximum = System.Int32.MinValue |> ref

(*
Probability of collisions:

0.1 -> 10%
0.01 -> 1%
0.001 -> 0.1%
0.0001 -> 0.01%
0.00001 -> 0.001%
0.000001 -> 0.0001%
0.0000001 -> 0.00001% (still with 2*n space)
*)

let collision = defaultArg probability 1.e-7

(*
Create & Update Bloom Filter, Min & Max values in parallel
*)

let bf = init (uint32 size) collision None
xs
|> Array.Parallel.iter (
fun k ->
Atomic.min minimum k |> ignore
Atomic.max maximum k |> ignore
bf >>= add (string k) |> ignore
)

(* This implementation is slower than when using lazy lists

let rec aux low high =
async {
if low < high then
let mid = low + high >>> 1
let! fork = Async.StartChild (aux low mid)
let! main = aux (mid+1) high
let! join = fork
return merge join main
else
let condition =
if bf >>= query (string low) = Some true then
[| low |]
else
[| |]
return condition
}

aux !minimum !maximum
|> Async.RunSynchronously

*)


let rec aux low high =
async {
if low < high then
let mid = low + high >>> 1
let! fork = Async.StartChild (aux low mid)
let! main = aux (mid+1) high
let! join = fork
return List.Lazy.merge (join, main)
else
let condition =
if bf >>= query (string low) = Some true then
List.Lazy.Cons (low, fun () -> List.Lazy.Nil)
else
List.Lazy.Nil
return condition
}

aux !minimum !maximum
|> Async.RunSynchronously
|> List.Lazy.toArray

(* Explanation with the finite byte Set: {0} ∪ ℕ<0xFF *)
let explanation () =

let pprint : 'a array -> unit =
fun xs ->
xs |> Array.chunkBySize 2 |> Array.iter (printfn "%A")

(* Generate unique random values in a specific range *)
let upper = System.Byte.MaxValue |> int
let lower = System.Byte.MinValue |> int
let size = (upper - lower) >>> 3
let randoms =
Array.Parallel.init size (fun _ -> Random.next lower upper)
|> Array.distinct

printfn "Random array of unique bytes:"
randoms |> printfn "%A"

(* Init with with constraints *)
let bf = BloomFilter.init (uint32 size) 0.0000001 <| None

(* Info *)
bf >>= BloomFilter.info >>= (printfn "Info:\n%A" >> Some) |> ignore

(* Before *)
printfn "Before:"
bf >>= BloomFilter.print >>= (pprint >> Some) |> ignore

(* Populate BF and narrow the Set by setting minimum and maximum values *)
let min' = System.Byte.MaxValue |> int |> ref
let max' = System.Byte.MinValue |> int |> ref
let populate =
randoms
|> Array.Parallel.iter(
fun b ->
Atomic.min min' b |> ignore
Atomic.max max' b |> ignore
bf >>= BloomFilter.add (string b) |> ignore
)

(* After *)
printfn "After:"
bf >>= BloomFilter.print >>= (pprint >> Some) |> ignore

(* Sort with a collision probability of 0.0000001 *)
let xs' =
let rec aux lower high =
async {
if lower < high then
let mid = lower + high >>> 1
let! fork = Async.StartChild (aux lower mid)
let! main = aux (mid+1) high
let! join = fork
return List.Lazy.merge (join, main)
else
let condition =
if bf >>= BloomFilter.query (string lower) = Some true then
List.Lazy.Cons (lower, fun () -> List.Lazy.Nil)
else
List.Lazy.Nil
return condition
}

aux !min' !max'
|> Async.RunSynchronously
|> List.Lazy.toArray

printfn "Array sorted: %b" <| Assert.sorted xs'
xs' |> printfn "%A"

(* Example of usage of the sorting algorithm *)
let examples () =
(* Generate unique random values in a specific range *)
let upper = 355_000
let lower = 115_050
let size = (upper - lower) >>> 1
let randoms =
Array.Parallel.init size (fun _ -> Random.next lower upper)
|> Array.distinct

let xs = randoms |> BloomFilter .sort <| None
let ys = randoms |> BloomFilter .Parallel.sort <| Some 1.e-7
let zs = randoms |> BloomFilter.Async.Parallel.sort <| Some 1.e-7

printfn "Assert arrays are sorted: %b"
(Assert.sorted xs && Assert.sorted ys && Assert.sorted zs)

explanation ()

Files

mon@razerRamon:~/tmp/haskell/granulate-effects$ ll
total 12K
-rwxr-xr-x 1 mon mon 3.2K Nov 24 20:57 GranulateEffects.hs*
mon@razerRamon:~/tmp/haskell/granulate-effects$

Haskell Code Snippet

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
#!/usr/bin/env stack
{- stack
--resolver lts-8.21
--install-ghc
runghc
--package http-client
--package http-client-tls
--
-Wall -Werror
-}


module GranulateEffects (main) where

import qualified Data.ByteString.Lazy.Char8 as L8

{- https://haskell-lang.org/library/http-client -}
import Network.HTTP.Client
{- https://hackage.haskell.org/package/http-client-tls -}
import Network.HTTP.Client.TLS (tlsManagerSettings)

{- As Kris Jenkins (@krisajenkins) says: "IO is the SUDO of side-effects" -}
getHtml :: String -> IO String
getHtml url = do
{- To get secure HTML we will need: TLS manager, request and response -}
manager <- newManager tlsManagerSettings
request <- parseRequest url
response <- httpLbs request manager
let html = L8.unpack $ responseBody response

{- Because we aren't limiting effects, we can also log to a file -}
writeFile "./logging.txt" html

return html

{- Lets granulate our effects to what we really want -}
class Monad m =>
MmanagerTLS m where
newManager' :: ManagerSettings -> m Manager

instance MmanagerTLS IO where
newManager' = newManager

class Monad m =>
Mrequest m where
parseRequest' :: String -> m Request

instance Mrequest IO where
parseRequest' = parseRequest

class Monad m =>
Mresponse m where
httpLbs' :: Request -> Manager -> m (Response L8.ByteString)

instance Mresponse IO where
httpLbs' = httpLbs

{- We update the signature and add our granulated prime functions -}
getHtml' :: (MmanagerTLS m, Mrequest m, Mresponse m) => String -> m String
getHtml' url = do
{- To get secure HTML we will need: TLS manager, request and response -}
manager <- newManager' tlsManagerSettings
request <- parseRequest' url
response <- httpLbs' request manager
let html = L8.unpack $ responseBody response

{- And we are no longer allowed to log to a file, only retrieve HTML -}
-- writeFile "./logging.txt" html
{- error:
• Couldn't match type ‘m’ with ‘IO’
‘m’ is a rigid type variable bound by
the type signature for:
getHtml' :: forall (m :: * -> *).
(MmanagerTLS m, Mrequest m, Mresponse m) =>
String -> m String
at GranulateEffects.hs:70:13
Expected type: m ()
Actual type: IO ()
• In a stmt of a 'do' block: writeFile "./logging.txt" html
In the expression:
do { manager <- newManager' tlsManagerSettings;
request <- parseRequest' url;
response <- httpLbs' request manager;
let html = L8.unpack $ responseBody response;
.... }
In an equation for ‘getHtml'’:
getHtml' url
= do { manager <- newManager' tlsManagerSettings;
request <- parseRequest' url;
response <- httpLbs' request manager;
.... }
• Relevant bindings include
getHtml' :: String -> m String (bound at GranulateEffects.hs:71:1)
-}


return html

{- And ofc it's easier to just put everything inside the same monad -}
class Monad m =>
HtmlHttpsM m where
newManager'' :: ManagerSettings -> m Manager
parseRequest'' :: String -> m Request
httpLbs'' :: Request -> Manager -> m (Response L8.ByteString)

instance HtmlHttpsM IO where
newManager'' = newManager
parseRequest'' = parseRequest
httpLbs'' = httpLbs

{- We update the signature and add our granulated prime prime functions -}
getHtml'' :: (HtmlHttpsM m) => String -> m String
getHtml'' url = do
{- To get secure HTML we will need: TLS manager, request and response -}
manager <- newManager'' tlsManagerSettings
request <- parseRequest'' url
response <- httpLbs'' request manager
let html = L8.unpack $ responseBody response

{- And we are still not allowed to log to a file, only retrieve HTML -}
-- writeFile "./logging.txt" html
{- error:
• Couldn't match type ‘m’ with ‘IO’
‘m’ is a rigid type variable bound by
the type signature for:
getHtml'' :: forall (m :: * -> *).
HtmlHttpsM m =>
String -> m String
at GranulateEffects.hs:111:14
Expected type: m ()
Actual type: IO ()
• In a stmt of a 'do' block: writeFile "./logging.txt" html
In the expression:
do { manager <- newManager'' tlsManagerSettings;
request <- parseRequest'' url;
response <- httpLbs'' request manager;
let html = L8.unpack $ responseBody response;
.... }
In an equation for ‘getHtml''’:
getHtml'' url
= do { manager <- newManager'' tlsManagerSettings;
request <- parseRequest'' url;
response <- httpLbs'' request manager;
.... }
• Relevant bindings include
getHtml'' :: String -> m String
(bound at GranulateEffects.hs:112:1)
-}


return html

main :: IO ()
main = do
html0 <- getHtml "https://www.google.dk/"
html1 <- getHtml' "https://www.google.dk/"
html2 <- getHtml'' "https://www.google.dk/"
print $ html0
print $ html1
print $ html2

Haskell Code output:

mon@razerRamon:~/tmp/haskell/granulate-effects$ ./GranulateEffects.hs
<!doctype html> ... </script></div></body></html>
<!doctype html> ... </script></div></body></html>
<!doctype html> ... </script></div></body></html>
mon@razerRamon:~/tmp/haskell/granulate-effects$

Files

mon@razerRamon:~/tmp/haskell/granulate-effects$ ll
total 68K
-rwxr-xr-x 1 mon mon 3.2K Nov 24 20:57 GranulateEffects.hs*
-rw-r--r-- 1 mon mon 46K Nov 24 21:29 logging.txt
mon@razerRamon:~/tmp/haskell/granulate-effects$

References:

Huffman Code Snippet

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#!/usr/bin/env fsharpi

(* This construct is for ML compatibility. The syntax '(typ,...,typ) ident'
is not used in F# code. Consider using 'ident<typ,...,typ>' instead. *)

#nowarn "62"

[<RequireQualifiedAccess>]
module Huffman =

type tree =
| Leaf of frecuency * char
| Node of frecuency * tree * tree
and frecuency = int

let frecuency : tree -> frecuency * tree = function
| Leaf (f,_) as t -> f,t
| Node (f,_,_) as t -> f,t

let encoding : string -> tree =
fun text ->
let rec aux : tree Set -> tree =
fun acc0 ->
if Set.isEmpty acc0 then
failwith "Set is empty"
else if Set.count acc0 = 1 then
Set.minElement acc0
else
let (f1,x) = acc0 |> Set.minElement |> frecuency
let (acc1) = acc0 |> Set.remove x
let (f2,y) = acc1 |> Set.minElement |> frecuency
let (acc2) = acc1 |> Set.remove y

acc2 |> Set.add (Node (f1+f2,x,y)) |> aux
text
|> Seq.groupBy id
|> Seq.map (fun (x,xs) -> xs |> Seq.length,x)
|> Set.ofSeq
|> Set.map Leaf
|> aux

let tree2charmap : tree -> (char, int list) Map =
fun t ->
let rec aux acc : tree -> (char, int list) Map = function
| Leaf (_,c) -> Map.empty |> Map.add c (List.rev acc)
| Node (_,l,r) ->
let ml = aux (0::acc) l
let mr = aux (1::acc) r
ml |> Map.fold(fun a k v -> a |> Map.add k v) mr
aux [] t

let table : (char, int list) Map -> string =
fun map ->
map
|> Map.toList
|> List.map (fun (k,v) -> sprintf "'%c': %A\n" k v)
|> List.fold (fun a x -> a + x) ""

let tree2codemap : tree -> (int list, char) Map =
fun t ->
let rec aux acc : tree -> (int list, char) Map = function
| Leaf (_,c) -> Map.empty |> Map.add (List.rev acc) c
| Node (_,l,r) ->
let ml = aux (0::acc) l
let mr = aux (1::acc) r
ml |> Map.fold(fun a k v -> a |> Map.add k v) mr
aux [] t

let compress : (char, int list) Map -> string -> string =
fun map text ->
text
|> Seq.map (fun c -> map |> Map.find c)
|> Seq.collect id
|> Seq.fold (fun a x -> a + (string x)) ""

let decompress : (int list, char) Map -> string -> string =
fun map code ->
let rec aux acc code = function
| [ ] -> acc |> List.rev
| x :: xs ->
let code' = code @ [x]
let copt = map |> Map.tryFind code'
let acc',code'' =
match copt with
| Some c -> c::acc, [ ]
| None -> acc, code'
aux acc' code'' xs
let xs = code |> Seq.map (string >> int) |> Seq.toList
aux [] [] xs |> List.fold (fun a x -> a + (string x)) ""

Huffman Code output:

module Huffman = begin
type tree =
| Leaf of frecuency * char
| Node of frecuency * tree * tree
and frecuency = int
val frecuency : _arg1:tree -> frecuency * tree
val encoding : text:string -> tree
val tree2charmap : t:tree -> Map<char,int list>
val table : map:Map<char,int list> -> string
val tree2codemap : t:tree -> Map<int list,char>
val compress : map:Map<char,int list> -> text:string -> string
val decompress : map:Map<int list,char> -> code:string -> string
end

Execution Code Snippet

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
let text = """this is an example of a huffman tree"""

let encoding =
text
|> Huffman.encoding

let tree2charmap =
encoding
|> Huffman.tree2charmap

let tree2codemap =
encoding
|> Huffman.tree2codemap

let table =
tree2charmap
|> Huffman.table

let compressed =
text
|> Huffman.compress tree2charmap

let decompressed =
compressed
|> Huffman.decompress tree2codemap

printfn "* Text:\n«%s»\n" text
printfn "* Table:\n%s" table
printfn "* Compressed:\n%s\n" compressed
printfn "* Decompressed:\n«%s»\n" decompressed
mon@razerRamon:~/tmp$ ./HuffmanCoding.fsx

Execution Code output:

* Text:
«this is an example of a huffman tree»

* Table:
' ': [1; 1; 1]
'a': [0; 0; 1]
'e': [1; 1; 0]
'f': [0; 0; 0]
'h': [1; 0; 1; 1; 0]
'i': [1; 0; 1; 1; 1]
'l': [1; 0; 0; 0; 0]
'm': [0; 1; 0; 0]
'n': [0; 1; 0; 1]
'o': [1; 0; 0; 0; 1]
'p': [1; 0; 0; 1; 0]
'r': [1; 0; 0; 1; 1]
's': [0; 1; 1; 0]
't': [0; 1; 1; 1]
'u': [1; 0; 1; 0; 0]
'x': [1; 0; 1; 0; 1]

* Compressed:
0111101101011101101111011101101110010101111110101010010100100101000011011110001000111001111101101010000000001000010101111011110011110110

* Decompressed:
«this is an example of a huffman tree»

References:

Real World Code Snippet

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
module RealWorld : sig                              

type world = private World
type α pure = private Pure of α
type α io = world (α pure * world)

val bind : α io β io β io
val lift : α io (α pure β io) β io

val ( >> ) : α io β io β io
val ( >>= ) : α io (α pure β io) β io

val unit : unit pure

val effect : (α β) α pure β io
val eval : unit io (unit pure * world)

end

= struct

type world = World
type α pure = Pure of α
type α io = world (α pure * world)

let bind : α io β io β io =
λ action1 action2 world0
let (a,world1) = action1 world0 in
let (b,world2) = action2 world1 in
(b,world2)

let lift : α io (α pure β io) β io =
λ action1 action2 world0
let (a,world1) = action1 world0 in
let (b,world2) = action2 a world1 in
(b,world2)

let ( >> ) : α io β io β io = bind
let ( >>= ) : α io (α pure β io) β io = lift

let unit : unit pure = Pure ()

let effect : (α β) α pure β io =
λ f (Pure a)
λ world Pure (f a), world

let eval : unit io (unit pure * world) =
λ main main World

end

Real World Code output:

module RealWorld :
sig
type world = private World
type 'a pure = private Pure of 'a
type 'a io = world -> 'a pure * world
val bind : 'a io -> 'b io -> 'b io
val lift : 'a io -> ('a pure -> 'b io) -> 'b io
val ( >> ) : 'a io -> 'b io -> 'b io
val ( >>= ) : 'a io -> ('a pure -> 'b io) -> 'b io
val unit : unit pure
val effect : ('a -> 'b) -> 'a pure -> 'b io
val eval : unit io -> unit pure * world
end

Utils Code Snippet

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
module Util = struct                                

open RealWorld

let (!) : α α io =
λ a effect (λ _ a) unit

let readLn : string io =
effect read_line unit

let putStr : string pure unit io =
effect print_string

let putStrLn : string pure unit io =
effect print_endline

let sample : unit io =
! "What is your name?"
>>= putStrLn
>> readLn
>>= λ a
! "How old are you?"
>>= putStrLn
>> readLn
>>= λ b
putStr a
>> ! ": "
>>= putStr
>> putStrLn b

end

Utils Code output:

module Util :
sig
val ( ! ) : 'a -> 'a RealWorld.io
val readLn : string RealWorld.io
val putStr : string RealWorld.pure -> unit RealWorld.io
val putStrLn : string RealWorld.pure -> unit RealWorld.io
val sample : unit RealWorld.io
end

Execution Code Snippet

let _ = Util.sample |> RealWorld.eval 
mon@razerRamon:~/tmp/ocaml$ ocaml real_world.ml

Execution Code output:

What is your name?
John Doe
How old are you?
42
John Doe: 42

References:

Real World Code Snippet

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
module RealWorld =

type World = private | World
and 'a Pure = private | Pure of 'a
and 'a IO = World -> ('a Pure * World)

let bind : 'a IO -> 'b IO -> 'b IO =
fun a1 a2 world0 ->
let (a,world1) = a1 world0 in
let (b,world2) = a2 world1 in
(b,world2)

let lift : 'a IO -> ('a Pure -> 'b IO) -> 'b IO =
fun a1 a2 world0 ->
let (a,world1) = a1 world0 in
let (b,world2) = a2 a world1 in
(b,world2)

let ( >> ) : 'a IO -> 'b IO -> 'b IO = bind
let ( >>= ) : 'a IO -> ('a Pure -> 'b IO) -> 'b IO = lift

let unit : unit Pure = Pure ()

let effect : ('a -> 'b) -> 'a Pure -> 'b IO =
fun f (Pure a) ->
fun world -> Pure (f a), world

let eval : unit IO -> unit Pure * World =
fun main -> main World

[<AutoOpen>]
module Don =

[<Sealed>]
type DonBuilder () =
member __.Yield (()) : unit IO = fun world -> unit,world
[<CustomOperation("bind")>]
member __.Bind' (a1, a2) = bind a1 a2
[<CustomOperation("lift")>]
member __.LiftM (a1, a2) = lift a1 a2

let don = new DonBuilder ()

Real World Code output:

module RealWorld = begin
type World = private | World
and 'a Pure = private | Pure of 'a
and 'a IO = World -> 'a Pure * World
val bind : a1:'a IO -> a2:'b IO -> world0:World -> 'b Pure * World
val lift :
a1:'a IO -> a2:('a Pure -> 'b IO) -> world0:World -> 'b Pure * World
val ( >> ) : ('a IO -> 'b IO -> 'b IO)
val ( >>= ) : ('a IO -> ('a Pure -> 'b IO) -> 'b IO)
val unit : unit Pure = Pure ()
val effect : f:('a -> 'b) -> 'a Pure -> 'b IO
val eval : main:unit IO -> unit Pure * World
module Don = begin
type DonBuilder =
class
new : unit -> DonBuilder
member Bind' : a1:'c IO * a2:'d IO -> 'd IO
member LiftM : a1:'a IO * a2:('a Pure -> 'b IO) -> 'b IO
member Yield : unit -> unit IO
end
val don : DonBuilder
end
end

Utils Code Snippet

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
module Util =

open RealWorld

module Random =
let private r = new System.Random ()
let next ( ) = r.Next ()

module Read =
let readInt () = System.Console.Read ()
let readKey () = System.Console.ReadKey ()
let readLine () = System.Console.ReadLine ()

let (!) : 'a -> 'a IO =
fun a -> effect (fun _ -> a) unit

let getRand : int IO =
effect Random.next unit

let putInt : int Pure -> unit IO =
effect <| printf "%i"

let readLn : string IO =
effect Read.readLine unit

let putStr : string Pure -> unit IO =
effect <| printf "%s"

let putStrLn : string Pure -> unit IO =
effect <| printfn "%s"

let sample1 : unit IO =
! "What is your name?"
>>= putStrLn
>> readLn
>>= fun a ->
! "How old are you?"
>>= putStrLn
>> readLn
>>= fun b ->
putStr a
>> ! ","
>>= putStr
>> putStrLn b

let sample2 : unit IO = don {
bind ! "What is your name?"
lift putStrLn
bind readLn

lift (fun a -> don {
bind ! "How old are you?"
lift putStrLn
bind readLn

lift (fun b -> don {
bind (putStr a)
bind ! ","
lift putStr
bind (putStrLn b)

})
})
}

Utils Code output:

module Util = begin
module Random = begin
val private r : System.Random
val next : unit -> int
end
module Read = begin
val readInt : unit -> int
val readKey : unit -> System.ConsoleKeyInfo
val readLine : unit -> string
end
val ( ! ) : a:'a -> 'a RealWorld.IO
val getRand : int RealWorld.IO
val putInt : (int RealWorld.Pure -> unit RealWorld.IO)
val readLn : string RealWorld.IO
val putStr : (string RealWorld.Pure -> unit RealWorld.IO)
val putStrLn : (string RealWorld.Pure -> unit RealWorld.IO)
val sample1 : unit RealWorld.IO
val sample2 : unit RealWorld.IO
end

Execution Code Snippet

let _ = Util.sample1 |> RealWorld.eval, Util.sample2 |> RealWorld.eval
mon@razerRamon:~/tmp/realWorld$ ./RealWorld.fsx

Execution Code output:

What is your name?
John Doe
How old are you?
42
John Doe,42
What is your name?
John Doe
How old are you?
42
John Doe,42

References: