### Background

A couple of years ago when I studied a semester abroad at Pisa University I took the following course Algorithm Engineering taught by Paolo Ferragina. On the tenth lecture we learned about Bloom filters, a space-efficient probabilistic data structure, that is used to test whether an element is a member of a set.

The data-structure is space-efficient as it saves the necessary information in
`bits`

where `HashTables`

will store the information in at
least a `bool`

, but will only use one of the `bits`

while
the rest are unused. See my previous code snippet F# Storage boolean
vs. bits/bytes where I point out the memory waste.

With regard to test if an element is in the set, the bloom filter can only
warranty that an element *definitely is not in set* while if an element is in
the filter, it still has to make a lookup on the underlying data-structure. There
is a *probability* that an element is in the filter but it isn’t in the
underlying data-structure (*False positives*). The *False positives* are due to
the *set of hashes* can set the same bit for different words. This is why it’s
very important to choose a *set of hashes* that uniformly distribute the values
across the filter:

For more information on how to choose the most space-efficient size of bits,
**m**, in relation to the probability of false positives, **p**, while choosing
the optimal number of hash functions, **k**, please read the Bloom
filters article.

### Implementation

Due to we soon at work are going to make native apps for several devices where
we need to synchronize data in order to use the apps when they are offline and
because we are going to use **F#**/**C#**, I thought to myself:*“Why not
implement a usable Bloom filter in F#?”*, just for *fun* of course :-)

At Pisa we only studied the theory behind the data structure, but we never
really implemented it so I didn’t had to much to start from. A search on
*Google* brought me to *Wikipedia* and Bill Mills Bloom Filters by
Example. I saw Bill implemented both the FNV-1 and
MurmurHash2 hashes. In order to make this task challenging, I decided
to firstly implement the FNV-1a as the author recommends to use this
instead of the FNV-1 due to the final octet is not as well
dispersed. And afterwards, I soon found out that I would also need to implement
the MurmurHash3 algorithm as it has support for a *seed* which can be
used to generate different hashes for the same key by re-using the same
algorithm. A few more searches on *Google*, Google’s CityHash and
Which hashing algorithm is best for uniqueness and speed?, and I
was convinced that MurmurHash3 was the right choice (CityHash uses
MurMurs *Magic numbers*).

After implementing the *hash functions* I knew that I also was going to need
some functions that could *get/set* a `bit`

in a
`byte`

. Very straightforward and *fun* task. I tried to make a
`byte`

pretty-printer, see my code snippet F# Storage boolean
vs. bits/bytes but I soon saw it would not be usable when the sizes
of the arrays would become big enough. So I implemented a `bits2png`

function, I got a bit inspired by Which hashing algorithm is best for
uniqueness and speed? on the way he represents the distribution of
the hashes in pictures.

The task of implementing the *Bloom filter* was straight forward once the other
bit/byte and hash functions were implemented. I added a `ceilPow`

function in order to `ceil`

up to nearest power of 2. This way
`&&& (n-1)`

can be used instead of `% n`

as modulo is a
division operation and should be more expensive than a *bitwise* operation. I
out-commented it as this wouldn’t give the optimal size of **m**, but if size
isn’t important and performance is, please out-comment the lines

Finally the last task, was to implement some IO functions that could *count the
lines of a file* and *find a specific word in a file*.

#### Byte/bit functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

let gbit = function
| (b,n) when n < 8 && n >= 0 ->
System.Convert.ToString(0uy+b, 2).PadLeft(8,'0').[n] |> string |> int
| _ -> failwith "There are only 8-bits in a byte"
let sbit = function
| (b,n) when n < 8 && n >= 0 -> b ||| byte (1 <<< (7-n))
| _ -> failwith "There are only 8-bits in a byte"
let pbyte' byte a b =
System.Convert.ToString(0uy+byte, 2).PadLeft(8,'0')
|> Seq.map(fun x -> x |> function | '0' -> a | _ -> b)
let pbyte byte = (byte,"□","■") |||> pbyte' |> Seq.reduce(+)
let html2color html = System.Drawing.ColorTranslator.FromHtml(html)
let bits2png (bytes:byte[]) =
let black = System.Drawing.Color.Black
let green = "#66ED11" |> html2color
let ts = System.DateTime.Now.ToString("o").Replace(':','.')
let n = bytes.Length * 8 |> float |> fun x -> x / 2048. |> ceil |> int
use bmp = new System.Drawing.Bitmap(2048,n)
let w,h = bmp.Width-1,bmp.Height-1
let rec background color = function
| (0,0) -> ()
| (x,0) -> bmp.SetPixel(x,0,color); background color (x-1,h)
| (x,y) -> bmp.SetPixel(x,y,color); background color (x,y-1)
(w,h) |> background black
let chunkpixel i =
let chunk = i >>> 11
let pixel = i - (chunk <<< 11)
chunk,pixel
bytes
|> Array.map(fun x -> (x,black,green) |||> pbyte')
|> Array.iteri(
fun i xs -> i * 8 |> chunkpixel |> fun (y,z) ->
xs |> Seq.iteri(fun j x -> bmp.SetPixel(z+j,y,x)))
bmp.Save(ts + @"_bits2png.png", System.Drawing.Imaging.ImageFormat.Png)

#### Hash functions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

let stb s = System.Text.Encoding.UTF8.GetBytes(s = s)
// [ mon@mbai7 fnv ] ./fnv1a32 -s foo => 0xa9f37ed7
let fnv1aHash key =
let fnvp = (1 <<< 24) + (1 <<< 8) + 0x93 |> uint32
let fnvob = 2166136261u
let b = key |> stb
let rec fnv1Hash' h = function
| i when i < (b.Length) ->
let h' =
h ^^^ (b.[i] |> uint32)
|> fun x -> x * fnvp
fnv1Hash' h' (i+1)
| _ -> h
fnv1Hash' fnvob 0
// [ mon@mbai7 murmur3-master ] ./example foo => x86_32: b12f489e (seed 42)
let murmur3Hash seed key =
let rotl x r = (x <<< r) ||| (x >>> (32 - r))
let fmix h =
h
|> fun x -> x ^^^ (x >>> 16)
|> fun x -> x * 0x85ebca6bu
|> fun x -> x ^^^ (x >>> 13)
|> fun x -> x * 0xc2b2ae35u
|> fun x -> x ^^^ (x >>> 16)
let getblock b i = System.BitConverter.ToUInt32(value = b, startIndex = i)
let data = key |> stb
let len = data |> Array.length
let nblocks = len >>> 2 // equivalent to len/4 but faster
let h1 = seed
let c1 = 0xcc9e2d51u
let c2 = 0x1b873593u
// body
let rec body h = function
| i when i < nblocks ->
let k1 =
getblock data (i * 4)
|> fun x -> x * c1
|> fun x -> rotl x 15
|> fun x -> x * c2
let h' =
h ^^^ k1
|> fun x -> rotl x 13
|> fun x -> x * 5u + 0xe6546b64u
body h' (i+1)
| _ -> h
let h1' = body h1 0
// tail
let tail = nblocks * 4
let rec tail' (k,h) = function
| 0 -> h
| 1 ->
let k' =
k ^^^ (uint32 data.[tail])
|> fun x -> x * c1
|> fun x -> rotl x 15
|> fun x -> x * c2
let h' = h ^^^ k'
tail' (k',h') (0)
| i ->
let k' =
(uint32 data.[tail + (i - 1)]) <<< (1 <<< (i + 1))
|> fun x -> k ^^^ x
tail' (k',h) (i-1)
let h1'' = tail' (0u,h1') (len &&& 3)
// finalization
h1'' ^^^ (uint32 len)
|> fun x -> x |> fmix

#### Bloom filter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

let rand = System.Random()
// http://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2Float
let ceilPow (x:uint32) =
x
|> fun x -> x-1u
|> fun x -> x ||| (x >>> 1)
|> fun x -> x ||| (x >>> 2)
|> fun x -> x ||| (x >>> 4)
|> fun x -> x ||| (x >>> 8)
|> fun x -> x ||| (x >>> 16)
|> fun x -> x+1u
// ISO 31-11
let ln x = x |> log
let lb x = System.Math.Log(x,2.)
let lg x = System.Math.Log10(x)
type BloomFilter(n:uint32, p) =
let m = // ceil up to nearest 2^n in order to use &&& (n-1) instead of % n
let n' = n |> float
let lnp = ln p
let ln2 = ln 2.
-(n' * lnp) / (ln2 ** 2.)
|> fun x -> x |> uint32
//|> fun x -> x |> ceilPow
//|> fun x -> x |> int
|> fun x -> x >>> 3 <<< 3 // round to nearest byte/bit
let bits = Array.init (m >>> 3 |> int) (fun i -> 0uy)
let seeds =
let m' = m |> float
let n' = n |> float
let ln2 = ln 2.
let k = (m'/n') * ln2
let k' = k |> ceil |> int
Array.init (k') (fun i -> rand.Next() |> uint32)
let hashes = seeds |> Array.map(fun x -> x |> murmur3Hash)
let bytebit hash =
let byte = hash >>> 3
let bit = hash - (byte <<< 3)
byte,bit
member x.add key =
hashes
|> Array.map(fun f -> f(key) % m |> int |> bytebit)
|> Array.iter(fun (x,y) -> bits.[x] <- (bits.[x],y) |> sbit)
member x.query key =
hashes
|> Array.map(fun f -> f(key) % m |> int |> bytebit)
|> Array.fold(fun a (x,y) -> (((bits.[x],y) |> gbit) = 1) && a) true
member x.print = bits |> bits2png
member x.info =
printfn "p: %e" p
printfn "m: %u" ((bits |> Array.length) |> uint32 |> fun x -> x * 8u)
printfn "k: %i" (hashes |> Array.length)
printfn "seeds: %A" (seeds)

#### File functions (IO):

1
2
3
4
5
6
7
8
9
10
11
12
13
14

let lines file = System.IO.File.ReadLines(file)
let count file = file |> lines |> Seq.length
let find word file =
use reader = System.IO.File.OpenText(file)
let rec find' = function
| true -> false
| false ->
match reader.ReadLine().Equals(word) with
| true -> true
| false -> find' reader.EndOfStream
find' reader.EndOfStream

### An English dictionary with +235.000 words

Once the code is implemented, let’s use the built-in English dictionary on my
**Mac** with the **File (IO) functions** to test the **Bloom Filter**:

For a file with **+235.000** lines, both the count and the find functions are
very fast:

Now that we have the number of elements, **n**, let’s choose a probability of
**0.1%**. For more information on how to choose the probability, check these
tables Bloom Filters - the math:

The optimal number of hashes, **k**, is set to **10**. The seeds values used for
the hash functions are also printed out in order to reproduce results. And
finally the initial set of bits is printed out as an **.png** file. Initially it
will be empty.

If we query the data-structure for a word we know exists in the dictionary, it will return false as it’s initially empty:

But if we populate the **Bloom filter** with all the words contained in the
dictionary, and we then query for the same word, we can see that know the query
will return **true**. For a word that we for sure know that is not in the
dictionary and hereby not in the filter, we can see that the result is **false**:

Now when we print again the bits to a **.png** file, it’s possible to view how
the different values are evenly distributed across the data-structure:

#### Correct way to handle values (including *False Positives*):

As stated at the beginning, just because a **key** returns **true**, it doesn’t
mean that the underlying data-structure will contain the item (**False
positives**). The correct way to handle the return values is to write the
following code:

Where the first function will need to check the value against the file and the
second function doesn’t need to, as the **key** is *definitely is not in set*.

### When size matters (+152.000.000 leaked e-mails in a 3.2 GB file)

You might argue that the lookups to the dictionary file aren’t that expensive
right? Why all the headache implementing this extra data-structure/filter on top
of the original data-structure. If you still aren’t convinced of why **Bloom
filters** are awesome, lets go through the following example. Inspired by a
**Company Friday Talk @ Delegate A/S** where one of my co-workers
showed how somebody had stored the password on **Windows Azure** and provided a
website to check whether an e-mails was compromised or
not.

Like anything on the Internet, once it’s out there, it’s very easy to get access
to it Reddit - Hacking. I downloaded the file and made a few
**grep/sed** operations in order to only have e-mails in the file. I then
appended an e-mail to the file in order for the linear search to take the most
time and also avoid using a real e-mail in this blog post. The result are the
following when applying the same **File (IO) functions** from before:

We can now see that all the call are about one minute. We can all agree that this is a very expensive call right?

Now lets instantiate the **Bloom filter** with the given **n** and a probability
of **1%** (any value here less than one percent, crashes my application, but the
one percent is OK for this example):

As before we populate the filter with the file content.

**Remark:** Good luck populating a **HashTable** with **3.2 GB** worth of data

Now we can make queries as before and we can see that both are constant calls,
**O(1)**. We now have a service that can tell a user almost instantly if his
e-mail isn’t leaked. Pretty cool right?

**Remark:** Once again, as stated at: *“Correct way to handle values (including
False Positives)”*, only the last value can be used. The first one still has
to make a lookup to the underlying data-structure. A few ideas on how to
optimize the file calls could be to split up the three major e-mail providers:
**Gmail, Yahoo and Hotmail** into separate files and the rest in a fourth
file. This approach with an `async`

process that would return the
second call would make the user experience more smooth for the end-user.