1 ! Copyright (C) 2009 Alec Berryman.
2 ! See http://factorcode.org/license.txt for BSD license.
3 USING: accessors arrays bit-arrays fry infix kernel layouts locals math
4 math.functions multiline sequences ;
7 FROM: math.ranges => [1,b] [0,b) ;
8 FROM: math.intervals => (a,b) interval-contains? ;
14 - The false positive rate is 10x what it should be, based on informal testing.
15 Better object hashes or a better method of generating extra hash codes would
16 help. Another way is to increase the number of bits used.
18 - Try something smarter than the bitwise complement for a second hash code.
20 - http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
21 makes a case for http://murmurhash.googlepages.com/ instead of enhanced
24 - Be sure to adjust the test that asserts the number of false positives isn't
27 - Could round bits up to next power of two and use wrap instead of mod. This
28 would cost a lot of bits on 32-bit platforms, though, and limit the bit-array
31 - Should allow user to specify the hash codes, either as inputs to enhanced
32 double hashing or for direct use.
34 - Support for serialization.
36 - Wrappers for combining filters.
38 - Should we signal an error when inserting past the number of objects the filter
39 is sized for? The filter will continue to work, just not very well.
44 { n-hashes fixnum read-only }
45 { bits bit-array read-only }
46 { maximum-n-objects fixnum read-only }
47 { current-n-objects fixnum } ;
49 ERROR: capacity-error ;
50 ERROR: invalid-error-rate ;
51 ERROR: invalid-n-objects ;
55 ! infix doesn't like ^
59 :: bits-to-satisfy-error-rate ( hashes error objects -- size )
60 [infix -(objects * hashes) / log(1 - pow(error, (1/hashes))) infix]
63 ! 100 hashes ought to be enough for anybody.
64 : n-hashes-range ( -- range )
68 : identity-configuration ( -- 2seq )
69 0 max-array-capacity 2array ;
71 : smaller-second ( 2seq 2seq -- 2seq )
72 [ [ second ] bi@ <= ] most ;
74 ! If the number of hashes isn't positive, we haven't found anything smaller than the
75 ! identity configuration.
76 : validate-sizes ( 2seq -- )
77 first 0 <= [ capacity-error ] when ;
79 ! The consensus on the tradeoff between increasing the number of bits and
80 ! increasing the number of hash functions seems to be "go for the smallest
81 ! number of bits", probably because most implementations just generate one hash
82 ! value and cheaply mangle it into the number of hashes they need. I have not
83 ! seen any usage studies from the implementations that made this tradeoff to
84 ! support it, and I haven't done my own, but we'll go with it anyway.
86 : size-bloom-filter ( error-rate number-objects -- number-hashes number-bits )
87 [ n-hashes-range identity-configuration ] 2dip
88 '[ dup [ _ _ bits-to-satisfy-error-rate ]
89 call 2array smaller-second ]
94 : validate-n-objects ( n-objects -- )
95 0 <= [ invalid-n-objects ] when ;
97 : valid-error-rate-interval ( -- interval )
100 : validate-error-rate ( error-rate -- )
101 valid-error-rate-interval interval-contains?
102 [ invalid-error-rate ] unless ;
104 : validate-constraints ( error-rate n-objects -- )
105 validate-n-objects validate-error-rate ;
109 : <bloom-filter> ( error-rate number-objects -- bloom-filter )
110 [ validate-constraints ] 2keep
111 [ size-bloom-filter <bit-array> ] keep
117 ! See "Bloom Filters in Probabilistic Verification" by Peter C. Dillinger and
118 ! Panagiotis Manolios, section 5.2, "Enhanced Double Hashing":
119 ! http://www.cc.gatech.edu/~manolios/research/bloom-filters-verification.html
120 :: enhanced-double-hash ( index hash0 hash1 -- hash )
121 [infix hash0 + (index * hash1) + ((pow(index, 3) - index) / 6) infix] ;
123 : enhanced-double-hashes ( hash0 hash1 n -- seq )
125 [ '[ _ _ enhanced-double-hash ] ] dip
128 ! Make sure it's a fixnum here to speed up double-hashing.
129 : hashcodes-from-hashcode ( n -- n n )
130 dup most-positive-fixnum bitxor ;
132 : hashcodes-from-object ( obj -- n n )
133 hashcode abs hashcodes-from-hashcode ;
135 : set-indices ( indices bit-array -- )
136 [ [ drop t ] change-nth ] curry each ;
138 : increment-n-objects ( bloom-filter -- )
139 [ 1 + ] change-current-n-objects drop ;
141 : n-hashes-and-length ( bloom-filter -- n-hashes length )
142 [ n-hashes>> ] [ bits>> length ] bi ;
144 : relevant-indices ( value bloom-filter -- indices )
145 [ hashcodes-from-object ] [ n-hashes-and-length ] bi*
146 [ enhanced-double-hashes ] dip '[ _ mod ] map ;
150 : bloom-filter-insert ( object bloom-filter -- )
151 [ increment-n-objects ]
153 [ bits>> set-indices ]
156 : bloom-filter-member? ( object bloom-filter -- ? )
157 [ relevant-indices ] keep
158 bits>> nths [ ] all? ;