Fork me on GitHub

Background

A couple of years ago when I studied a semester abroad at Pisa University I took the following course Algorithm Engineering taught by Paolo Ferragina. On the tenth lecture we learned about Bloom filters, a space-efficient probabilistic data structure, that is used to test whether an element is a member of a set.

The data-structure is space-efficient as it saves the necessary information in bits where HashTables will store the information in at least a bool, but will only use one of the bits while the rest are unused. See my previous code snippet F# Storage boolean vs. bits/bytes where I point out the memory waste.

With regard to test if an element is in the set, the bloom filter can only warranty that an element definitely is not in set while if an element is in the filter, it still has to make a lookup on the underlying data-structure. There is a probability that an element is in the filter but it isn’t in the underlying data-structure (False positives). The False positives are due to the set of hashes can set the same bit for different words. This is why it’s very important to choose a set of hashes that uniformly distribute the values across the filter:

All

For more information on how to choose the most space-efficient size of bits, m, in relation to the probability of false positives, p, while choosing the optimal number of hash functions, k, please read the Bloom filters article.

Implementation

Due to we soon at work are going to make native apps for several devices where we need to synchronize data in order to use the apps when they are offline and because we are going to use F#/C#, I thought to myself:“Why not implement a usable Bloom filter in F#?”, just for fun of course :-)

At Pisa we only studied the theory behind the data structure, but we never really implemented it so I didn’t had to much to start from. A search on Google brought me to Wikipedia and Bill Mills Bloom Filters by Example. I saw Bill implemented both the FNV-1 and MurmurHash2 hashes. In order to make this task challenging, I decided to firstly implement the FNV-1a as the author recommends to use this instead of the FNV-1 due to the final octet is not as well dispersed. And afterwards, I soon found out that I would also need to implement the MurmurHash3 algorithm as it has support for a seed which can be used to generate different hashes for the same key by re-using the same algorithm. A few more searches on Google, Google’s CityHash and Which hashing algorithm is best for uniqueness and speed?, and I was convinced that MurmurHash3 was the right choice (CityHash uses MurMurs Magic numbers).

After implementing the hash functions I knew that I also was going to need some functions that could get/set a bit in a byte. Very straightforward and fun task. I tried to make a byte pretty-printer, see my code snippet F# Storage boolean vs. bits/bytes but I soon saw it would not be usable when the sizes of the arrays would become big enough. So I implemented a bits2png function, I got a bit inspired by Which hashing algorithm is best for uniqueness and speed? on the way he represents the distribution of the hashes in pictures.

The task of implementing the Bloom filter was straight forward once the other bit/byte and hash functions were implemented. I added a ceilPow function in order to ceil up to nearest power of 2. This way &&& (n-1) can be used instead of % n as modulo is a division operation and should be more expensive than a bitwise operation. I out-commented it as this wouldn’t give the optimal size of m, but if size isn’t important and performance is, please out-comment the lines

Finally the last task, was to implement some IO functions that could count the lines of a file and find a specific word in a file.

Byte/bit functions:

 1 let gbit = function
 2   | (b,n) when n < 8 && n >= 0 ->
 3     System.Convert.ToString(0uy+b, 2).PadLeft(8,'0').[n] |> string |> int
 4   | _ -> failwith "There are only 8-bits in a byte"
 5 
 6 let sbit = function
 7   | (b,n) when n < 8 && n >= 0 -> b ||| byte (1 <<< (7-n))
 8   | _ -> failwith "There are only 8-bits in a byte"
 9 
10 let pbyte' byte a b =
11   System.Convert.ToString(0uy+byte, 2).PadLeft(8,'0')
12   |> Seq.map(fun x -> x |> function | '0' -> a | _ -> b)
13 let pbyte byte = (byte,"□","■") |||> pbyte' |> Seq.reduce(+)
14 
15 let html2color html = System.Drawing.ColorTranslator.FromHtml(html)
16 
17 let bits2png (bytes:byte[]) =
18   let black  = System.Drawing.Color.Black
19   let green = "#66ED11" |> html2color
20   let ts = System.DateTime.Now.ToString("o").Replace(':','.')
21 
22   let n = bytes.Length * 8 |> float |> fun x -> x / 2048. |> ceil |> int
23   use bmp = new System.Drawing.Bitmap(2048,n)
24   let w,h = bmp.Width-1,bmp.Height-1
25   
26   let rec background color = function
27     | (0,0) -> ()
28     | (x,0) -> bmp.SetPixel(x,0,color); background color (x-1,h)
29     | (x,y) -> bmp.SetPixel(x,y,color); background color (x,y-1)
30   (w,h) |> background black
31   
32   let chunkpixel i =
33     let chunk = i >>> 11
34     let pixel = i - (chunk <<< 11)
35     chunk,pixel
36     
37   bytes
38   |> Array.map(fun x -> (x,black,green) |||> pbyte')
39   |> Array.iteri(
40     fun i xs -> i * 8 |> chunkpixel |> fun (y,z) ->
41     xs |> Seq.iteri(fun j x -> bmp.SetPixel(z+j,y,x)))
42   
43   bmp.Save(ts + @"_bits2png.png", System.Drawing.Imaging.ImageFormat.Png)

Hash functions:

 1 let stb s = System.Text.Encoding.UTF8.GetBytes(s = s)
 2 
 3 // [ mon@mbai7 fnv ] ./fnv1a32 -s foo => 0xa9f37ed7
 4 let fnv1aHash key =
 5   let fnvp  = (1 <<< 24) + (1 <<< 8) + 0x93 |> uint32
 6   let fnvob = 2166136261u
 7   
 8   let b = key |> stb
 9   
10   let rec fnv1Hash' h = function
11     | i when i < (b.Length) ->
12       let h'  =
13         h ^^^ (b.[i] |> uint32)
14         |> fun x -> x * fnvp
15       fnv1Hash' h' (i+1)
16     | _ -> h
17   fnv1Hash' fnvob 0
18 
19 // [ mon@mbai7 murmur3-master ] ./example foo => x86_32: b12f489e (seed 42)
20 let murmur3Hash seed key =
21   let rotl x r = (x <<< r) ||| (x >>> (32 - r))
22   let fmix h =
23     h
24     |> fun x -> x ^^^ (x >>> 16)
25     |> fun x -> x * 0x85ebca6bu
26     |> fun x -> x ^^^ (x >>> 13)
27     |> fun x -> x * 0xc2b2ae35u
28     |> fun x -> x ^^^ (x >>> 16)
29   let getblock b i = System.BitConverter.ToUInt32(value = b, startIndex = i)
30   
31   let data    = key  |> stb
32   let len     = data |> Array.length
33   let nblocks = len >>> 2 // equivalent to len/4 but faster
34   let h1      = seed
35   let c1      = 0xcc9e2d51u
36   let c2      = 0x1b873593u
37   
38   // body
39   let rec body h = function
40     | i when i < nblocks ->
41       let k1 =
42         getblock data (i * 4)
43         |> fun x -> x * c1
44         |> fun x -> rotl x 15
45         |> fun x -> x * c2
46       let h' =
47         h ^^^ k1
48         |> fun x -> rotl x 13
49         |> fun x -> x * 5u + 0xe6546b64u
50       body h' (i+1)
51     | _ -> h
52   let h1' = body h1 0
53   
54   // tail
55   let tail = nblocks * 4
56   let rec tail' (k,h) = function
57     | 0 -> h
58     | 1 -> 
59       let k' =
60         k ^^^ (uint32 data.[tail])
61         |> fun x -> x * c1
62         |> fun x -> rotl x 15
63         |> fun x -> x * c2
64       let h' = h ^^^ k'
65       tail' (k',h') (0)
66     | i ->
67       let k' =
68         (uint32 data.[tail + (i - 1)]) <<< (1 <<< (i + 1))
69         |> fun x -> k ^^^ x
70       tail' (k',h) (i-1)
71   let h1'' = tail' (0u,h1') (len &&& 3)
72   
73   // finalization
74   h1'' ^^^ (uint32 len)
75   |> fun x -> x |> fmix

Bloom filter:

 1 let rand  = System.Random()
 2 
 3 // http://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2Float
 4 let ceilPow (x:uint32) =
 5   x
 6   |> fun x -> x-1u
 7   |> fun x -> x ||| (x >>> 1)
 8   |> fun x -> x ||| (x >>> 2)
 9   |> fun x -> x ||| (x >>> 4)
10   |> fun x -> x ||| (x >>> 8)
11   |> fun x -> x ||| (x >>> 16)
12   |> fun x -> x+1u
13 
14 // ISO 31-11
15 let ln x = x |> log
16 let lb x = System.Math.Log(x,2.)
17 let lg x = System.Math.Log10(x)
18 
19 type BloomFilter(n:uint32, p) =
20   let m = // ceil up to nearest 2^n in order to use &&& (n-1) instead of % n
21     let n'  = n |> float
22     let lnp = ln p
23     let ln2 = ln 2.
24     -(n' * lnp) / (ln2 ** 2.)
25     |> fun x -> x |> uint32
26     //|> fun x -> x |> ceilPow
27     //|> fun x -> x |> int 
28     |> fun x -> x >>> 3 <<< 3 // round to nearest byte/bit
29   let bits = Array.init (m >>> 3 |> int) (fun i -> 0uy)
30   let seeds =
31     let m'  = m |> float
32     let n'  = n |> float
33     let ln2 = ln 2.
34     let k   = (m'/n') * ln2
35     let k'  = k |> ceil |> int
36     Array.init (k') (fun i -> rand.Next() |> uint32)
37   let hashes = seeds |> Array.map(fun x -> x |> murmur3Hash)
38   let bytebit hash =
39     let byte = hash >>> 3
40     let bit = hash - (byte <<< 3)
41     byte,bit
42   member x.add key =
43     hashes
44     |> Array.map(fun f -> f(key) % m |> int |> bytebit)
45     |> Array.iter(fun (x,y) -> bits.[x] <- (bits.[x],y) |> sbit)
46   member x.query key =
47     hashes
48     |> Array.map(fun f -> f(key) % m |> int |> bytebit)
49     |> Array.fold(fun a (x,y) -> (((bits.[x],y) |> gbit) = 1) && a) true
50   member x.print = bits |> bits2png
51   member x.info =
52     printfn "p: %e" p
53     printfn "m: %u" ((bits |> Array.length) |> uint32 |> fun x -> x * 8u)
54     printfn "k: %i" (hashes |> Array.length)
55     printfn "seeds: %A" (seeds)

File functions (IO):

 1 let lines file = System.IO.File.ReadLines(file)
 2 
 3 let count file = file |> lines |> Seq.length
 4 
 5 let find word file =
 6   use reader = System.IO.File.OpenText(file)
 7   
 8   let rec find' = function
 9     | true -> false
10     | false ->
11       match reader.ReadLine().Equals(word) with
12         | true -> true
13         | false -> find' reader.EndOfStream
14   find' reader.EndOfStream

An English dictionary with +235.000 words

Once the code is implemented, let’s use the built-in English dictionary on my Mac with the File (IO) functions to test the Bloom Filter:

 @"/usr/share/dict/words" |> count;;
(@"Zyzzogeton",@"/usr/share/dict/words") ||> find;;
(@"gentauro",@"/usr/share/dict/words") ||> find;;

For a file with +235.000 lines, both the count and the find functions are very fast:

> #time;; 
> Real: 00:00:00.066, CPU: 00:00:00.068, GC gen0: 2, gen1: 0
val it : int = 235886
> Real: 00:00:00.054, CPU: 00:00:00.054, GC gen0: 3, gen1: 0
val it : bool = true
> Real: 00:00:00.050, CPU: 00:00:00.050, GC gen0: 2, gen1: 0
val it : bool = false

Now that we have the number of elements, n, let’s choose a probability of 0.1%. For more information on how to choose the probability, check these tables Bloom Filters - the math:

let bloom = new BloomFilter(235886u, 1.e-3);; // Ex: 1.e-3 for a 0.1 %
bloom.info;;
bloom.printf;;

The optimal number of hashes, k, is set to 10. The seeds values used for the hash functions are also printed out in order to reproduce results. And finally the initial set of bits is printed out as an .png file. Initially it will be empty.

val bloom : BloomFilter
p: 1.000000e-003
m: 3391464
k: 10
seeds: [|209312772u; 1554706806u; 1919816475u; 2123757442u; 1979313534u;
  1239494964u; 585019070u; 384168856u; 1683270745u; 1807577208u|]
val it : unit = ()

All

@"Zyzzogeton" |> bloom.query;;

If we query the data-structure for a word we know exists in the dictionary, it will return false as it’s initially empty:

val it : bool = false

But if we populate the Bloom filter with all the words contained in the dictionary, and we then query for the same word, we can see that know the query will return true. For a word that we for sure know that is not in the dictionary and hereby not in the filter, we can see that the result is false:

@"/usr/share/dict/words" |> lines |> Seq.iter(fun x -> x |> bloom.add);;
@"Zyzzogeton" |> bloom.query;;
@"gentauro" |> bloom.query;;
bloom.print;;
val it : unit = ()
val it : bool = true
val it : bool = false
val it : unit = ()

Now when we print again the bits to a .png file, it’s possible to view how the different values are evenly distributed across the data-structure:

All

Correct way to handle values (including False Positives):

As stated at the beginning, just because a key returns true, it doesn’t mean that the underlying data-structure will contain the item (False positives). The correct way to handle the return values is to write the following code:

@"Zyzzogeton"
|> fun x ->
  x |> bloom.query |> function
    | false -> false
    | true -> (x,@"/usr/share/dict/words") ||> find
@"gentauro"
|> fun x ->
  x |> bloom.query |> function
    | false -> false
    | true -> (x,@"/usr/share/dict/words") ||> find
val it : bool = true
val it : bool = false

Where the first function will need to check the value against the file and the second function doesn’t need to, as the key is definitely is not in set.

When size matters (+152.000.000 leaked e-mails in a 3.2 GB file)

You might argue that the lookups to the dictionary file aren’t that expensive right? Why all the headache implementing this extra data-structure/filter on top of the original data-structure. If you still aren’t convinced of why Bloom filters are awesome, lets go through the following example. Inspired by a Company Friday Talk @ Delegate A/S where one of my co-workers showed how somebody had stored the password on Windows Azure and provided a website to check whether an e-mails was compromised or not.

Like anything on the Internet, once it’s out there, it’s very easy to get access to it Reddit - Hacking. I downloaded the file and made a few grep/sed operations in order to only have e-mails in the file. I then appended an e-mail to the file in order for the linear search to take the most time and also avoid using a real e-mail in this blog post. The result are the following when applying the same File (IO) functions from before:

 @"leaks/emails.txt" |> count;;
(@"bloom.filter@mail.co.dk",@"leaks/emails.txt") ||> find;;
(@"u.mad@mail.co.dk",@"leaks/emails.txt") ||> find;;

We can now see that all the call are about one minute. We can all agree that this is a very expensive call right?

> #time;;
> Real: 00:00:59.585, CPU: 00:00:58.653, GC gen0: 2215, gen1: 0
val it : int = 152448315
> Real: 00:01:03.203, CPU: 00:01:01.542, GC gen0: 2216, gen1: 0
val it : bool = true
> Real: 00:00:55.632, CPU: 00:00:55.446, GC gen0: 2216, gen1: 0
val it : bool = false

Now lets instantiate the Bloom filter with the given n and a probability of 1% (any value here less than one percent, crashes my application, but the one percent is OK for this example):

let bloom = new BloomFilter(153u*1000u*1000u, 1.e-2);;
bloom.info;;
@"leaks/emails.txt" |> lines |> Seq.iter(fun x -> x |> bloom.add);;

As before we populate the filter with the file content.

val bloom : BloomFilter
p: 1.000000e-002
m: 1466513928
k: 7
seeds: [|2143730668u; 2132815355u; 1369875737u; 1908004139u; 496837579u;
  406257522u; 1163911365u|]
val it : unit = ()
val it : unit = ()

Remark: Good luck populating a HashTable with 3.2 GB worth of data

@"bloom.filter@mail.co.dk" |> bloom.query;;
@"u.mad@mail.co.dk" |> bloom.query;;

Now we can make queries as before and we can see that both are constant calls, O(1). We now have a service that can tell a user almost instantly if his e-mail isn’t leaked. Pretty cool right?

> #time;;
> Real: 00:00:00.002, CPU: 00:00:00.006, GC gen0: 0, gen1: 0
val it : bool = true
> Real: 00:00:00.000, CPU: 00:00:00.000, GC gen0: 0, gen1: 0
val it : bool = false

Remark: Once again, as stated at: “Correct way to handle values (including False Positives)”, only the last value can be used. The first one still has to make a lookup to the underlying data-structure. A few ideas on how to optimize the file calls could be to split up the three major e-mail providers: Gmail, Yahoo and Hotmail into separate files and the rest in a fourth file. This approach with an async process that would return the second call would make the user experience more smooth for the end-user.

Code Snippet:

 1 let pbyte b =
 2   System.Convert.ToString(0uy+b, 2).PadLeft(8,'0')
 3   |> Seq.map(fun x -> x |> function | '0' -> "□" | _ -> "■")
 4   |> Seq.reduce(+)
 5 
 6 let rand  = System.Random()
 7 
 8 sizeof<bool>;;
 9 sizeof<byte>;;
10 
11 // Array of random booleans
12 Array.init (1<<<6) (
13   fun _ -> System.Convert.ToBoolean(rand.Next(0,2))
14            |> fun x -> System.Convert.ToByte(x) |> pbyte);;
15 
16 // Array of random bytes
17 Array.init (1<<<6) (fun _ -> byte(rand.Next(0,256)))
18 |> Array.map(fun x -> x |> pbyte);;

Code output:

> val it : int = 1
> val it : int = 1
> val it : string [] =
  [|"□□□□□□□□"; "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□□";
    "□□□□□□□□"; "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□■"; "□□□□□□□□";
    "□□□□□□□□"; "□□□□□□□□"; "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□□"; "□□□□□□□■";
    "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□■";
    "□□□□□□□□"; "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□□";
    "□□□□□□□□"; "□□□□□□□□"; "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□■";
    "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□■"; "□□□□□□□□";
    "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□□"; "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□■";
    "□□□□□□□■"; "□□□□□□□□"; "□□□□□□□□"; "□□□□□□□□"; "□□□□□□□□"; "□□□□□□□■";
    "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□■";
    "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□■"; "□□□□□□□■"|]
> val it : string [] =
  [|"■■□■□■■□"; "■□■□■□■■"; "□□□■■□■■"; "■■■■■■■■"; "■■□■□■□■"; "■■■■□□□□";
    "■■■■□■□■"; "□■■□■□■■"; "■■■□□□□■"; "□■■■□■□□"; "□■■■□■■■"; "□■□□■□□■";
    "■□□■□■□■"; "□□■□□■■□"; "□□□□□□□■"; "□□■■■□□□"; "■■□■□□■■"; "□□□■■■■■";
    "□■□□□■□□"; "□■□■■□■□"; "■■□□■□■■"; "□■■■□□■□"; "■□□□■■■□"; "■■■■□■□□";
    "■□■□■■■■"; "■□□■■□■■"; "■□■□■□■□"; "■□□■■■□□"; "■□■□□□■■"; "■■■□■■■□";
    "□□□■□□■□"; "□□■□■□□□"; "□■■■□□□□"; "■□□□□□□■"; "■□■■■■□■"; "■■■■□■□■";
    "□□□■■□□□"; "□■□■□■□□"; "■□■□■□□□"; "■□□□■■■□"; "■□■□□□■□"; "□□□■■■■□";
    "□□■□■■■□"; "■■□□□■■□"; "□□□□□□□□"; "□□□□■■■□"; "□■■□□□■■"; "■■■□■□□■";
    "□□■■□■■□"; "■■■■□■□■"; "□□□□□■■■"; "□□□□□□□□"; "□■□■□□■■"; "■■□□■■□■";
    "■■□■□■□■"; "■□■□■■■□"; "□□■■■□■■"; "■□□■■□■□"; "□■□□□□■□"; "■■■□□□□□";
    "■■□■□■■■"; "■□■□□□□■"; "■■□□□□■■"; "□■□■□□■□"|]

Last year the Human Rights Campaign created the Marriage Equality logo:

All

I thought to myself, why not make a logo for the ƒuntional people who want to create ƒunctional applications but aren’t allowed to do it because of the old fashioned and conservative software industry. So I made this logo:

All

Well I made the logo but forgot to write a blog post about it. But now and coinciding with the 1 year anniversary of the Marriage Equality logo, 2013-03-25, why not bring up the issue with this logo so people who want to make ƒuntional software can do it without being worried or persecuted by the masses.

If you have a little ƒuntional programmer in you, now it’s the time to come out of the closet :-)

Bret Victor The Future of Programming