1 ! Copyright (C) 2009 Alec Berryman.
2 ! See http://factorcode.org/license.txt for BSD license.
3 USING: accessors arrays bit-arrays fry kernel kernel.private
4 layouts locals math math.functions math.order math.private
5 multiline sequences sequences.private typed ;
6 FROM: math.ranges => [1,b] ;
14 - The false positive rate is 10x what it should be, based on
15 informal testing. Better object hashes or a better method of
16 generating extra hash codes would help. Another way is to
17 increase the number of bits used.
19 - Try something smarter than the bitwise complement for a
22 - http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
23 makes a case for http://murmurhash.googlepages.com/ instead
24 of enhanced double-hashing.
26 - Be sure to adjust the test that asserts the number of false
27 positives isn't unreasonable.
29 - Could round bits up to next power of two and use wrap instead
30 of mod. This would cost a lot of bits on 32-bit platforms,
31 though, and limit the bit-array to 8MB.
33 - Should allow user to specify the hash codes, either as inputs
34 to enhanced double hashing or for direct use.
36 - Support for serialization.
38 - Wrappers for combining filters.
40 - Should we signal an error when inserting past the number of
41 objects the filter is sized for? The filter will continue to
42 work, just not very well.
47 { #hashes fixnum read-only }
48 { bits bit-array read-only }
49 { capacity fixnum read-only }
52 ERROR: invalid-size size ;
53 ERROR: invalid-error-rate error-rate ;
54 ERROR: invalid-capacity capacity ;
58 :: bits-to-satisfy-error-rate ( hashes error objects -- size )
59 objects hashes * neg error hashes recip ^ 1 swap - log /
62 ! 100 hashes ought to be enough for anybody.
63 : #hashes-range ( -- range )
67 : identity-configuration ( -- 2seq )
68 0 max-array-capacity 2array ;
70 : smaller-second ( 2seq 2seq -- 2seq )
71 [ [ second ] bi@ <= ] most ;
73 ! If the number of hashes isn't positive, we haven't found
74 ! anything smaller than the identity configuration.
75 : check-hashes ( 2seq -- 2seq )
76 dup first 0 <= [ invalid-size ] when ;
78 ! The consensus on the tradeoff between increasing the number of
79 ! bits and increasing the number of hash functions seems to be
80 ! "go for the smallest number of bits", probably because most
81 ! implementations just generate one hash value and cheaply
82 ! mangle it into the number of hashes they need. I have not
83 ! seen any usage studies from the implementations that made this
84 ! tradeoff to support it, and I haven't done my own, but we'll
86 : size-bloom-filter ( error-rate number-objects -- number-hashes number-bits )
87 [ #hashes-range identity-configuration ] 2dip '[
88 dup _ _ bits-to-satisfy-error-rate
90 ] reduce check-hashes first2 ;
92 : check-capacity ( capacity -- capacity )
93 dup 0 <= [ invalid-capacity ] when ;
95 : check-error-rate ( error-rate -- error-rate )
96 dup [ 0 after? ] [ 1 before? ] bi and
97 [ invalid-error-rate ] unless ;
101 : <bloom-filter> ( error-rate capacity -- bloom-filter )
102 [ check-error-rate ] [ check-capacity ] bi*
103 [ size-bloom-filter <bit-array> ] keep
109 ! See "Bloom Filters in Probabilistic Verification" by Peter C.
110 ! Dillinger and Panagiotis Manolios, section 5.2, "Enhanced
112 ! http://www.ccs.neu.edu/home/pete/research/bloom-filters-verification.html
113 ! http://www.cc.gatech.edu/~manolios/research/bloom-filters-verification.html
114 : combine-hashcodes ( index hash0 hash1 -- hash )
115 { fixnum fixnum fixnum } declare
116 [ [ [ 3 ^ ] [ - ] bi 6 /i ] keep ]
117 [ fixnum*fast ] [ fixnum+fast ] tri* + abs ;
119 : double-hashcodes ( object -- hash0 hash1 )
120 hashcode >fixnum dup most-positive-fixnum bitxor >fixnum ;
122 : increment-count ( bloom-filter -- )
123 [ 1 + ] change-count drop ; inline
125 : #hashes-and-length ( bloom-filter -- #hashes length )
126 [ #hashes>> ] [ bits>> length ] bi ; inline
128 : relevant-indices ( object bloom-filter -- n quot: ( elt -- n ) )
129 [ double-hashcodes ] [ #hashes-and-length ] bi*
130 [ -rot ] dip '[ _ _ combine-hashcodes _ mod ] ; inline
134 TYPED: bloom-filter-insert ( object bloom-filter: bloom-filter -- )
137 [ bits>> [ [ t ] 2dip set-nth-unsafe ] curry ]
138 tri compose each-integer ;
140 TYPED: bloom-filter-member? ( object bloom-filter: bloom-filter -- ? )
142 [ bits>> [ nth-unsafe ] curry ]
143 bi compose all-integers? ;