1 ! Copyright (C) 2009 Alec Berryman.
2 ! See http://factorcode.org/license.txt for BSD license.
3 USING: accessors arrays bit-arrays kernel kernel.private
4 layouts math math.functions math.order math.private
5 ranges multiline sequences sequences.private typed ;
13 - The false positive rate is 10x what it should be, based on
14 informal testing. Better object hashes or a better method of
15 generating extra hash codes would help. Another way is to
16 increase the number of bits used.
18 - Try something smarter than the bitwise complement for a
21 - http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
22 makes a case for http://murmurhash.googlepages.com/ instead
23 of enhanced double-hashing.
25 - Be sure to adjust the test that asserts the number of false
26 positives isn't unreasonable.
28 - Could round bits up to next power of two and use wrap instead
29 of mod. This would cost a lot of bits on 32-bit platforms,
30 though, and limit the bit-array to 8MB.
32 - Should allow user to specify the hash codes, either as inputs
33 to enhanced double hashing or for direct use.
35 - Support for serialization.
37 - Wrappers for combining filters.
39 - Should we signal an error when inserting past the number of
40 objects the filter is sized for? The filter will continue to
41 work, just not very well.
46 { #hashes fixnum read-only }
47 { bits bit-array read-only }
48 { capacity fixnum read-only }
51 ERROR: invalid-size size ;
52 ERROR: invalid-error-rate error-rate ;
53 ERROR: invalid-capacity capacity ;
57 :: bits-to-satisfy-error-rate ( hashes error objects -- size )
58 objects hashes * neg error hashes recip ^ 1 swap - log /
61 ! 100 hashes ought to be enough for anybody.
62 : #hashes-range ( -- range )
66 : identity-configuration ( -- 2seq )
67 0 max-array-capacity 2array ;
69 : smaller-second ( 2seq 2seq -- 2seq )
70 [ [ second ] bi@ <= ] most ;
72 ! If the number of hashes isn't positive, we haven't found
73 ! anything smaller than the identity configuration.
74 : check-hashes ( 2seq -- 2seq )
75 dup first 0 <= [ invalid-size ] when ;
77 ! The consensus on the tradeoff between increasing the number of
78 ! bits and increasing the number of hash functions seems to be
79 ! "go for the smallest number of bits", probably because most
80 ! implementations just generate one hash value and cheaply
81 ! mangle it into the number of hashes they need. I have not
82 ! seen any usage studies from the implementations that made this
83 ! tradeoff to support it, and I haven't done my own, but we'll
85 : size-bloom-filter ( error-rate number-objects -- number-hashes number-bits )
86 [ #hashes-range identity-configuration ] 2dip '[
87 dup _ _ bits-to-satisfy-error-rate
89 ] reduce check-hashes first2 ;
91 : check-capacity ( capacity -- capacity )
92 dup 0 <= [ invalid-capacity ] when ;
94 : check-error-rate ( error-rate -- error-rate )
95 dup [ 0 after? ] [ 1 before? ] bi and
96 [ invalid-error-rate ] unless ;
100 : <bloom-filter> ( error-rate capacity -- bloom-filter )
101 [ check-error-rate ] [ check-capacity ] bi*
102 [ size-bloom-filter <bit-array> ] keep
108 ! See "Bloom Filters in Probabilistic Verification" by Peter C.
109 ! Dillinger and Panagiotis Manolios, section 5.2, "Enhanced
111 ! http://www.ccs.neu.edu/home/pete/research/bloom-filters-verification.html
112 ! http://www.cc.gatech.edu/~manolios/research/bloom-filters-verification.html
113 : combine-hashcodes ( index hash0 hash1 -- hash )
114 { fixnum fixnum fixnum } declare
115 [ [ [ 3 ^ ] [ - ] bi 6 /i ] keep ]
116 [ fixnum*fast ] [ fixnum+fast ] tri* + abs ;
118 : double-hashcodes ( object -- hash0 hash1 )
119 hashcode >fixnum dup most-positive-fixnum bitxor >fixnum ;
121 : increment-count ( bloom-filter -- )
122 [ 1 + ] change-count drop ; inline
124 : #hashes-and-length ( bloom-filter -- #hashes length )
125 [ #hashes>> ] [ bits>> length ] bi ; inline
127 : relevant-indices ( object bloom-filter -- n quot: ( elt -- n ) )
128 [ double-hashcodes ] [ #hashes-and-length ] bi*
129 -rotd '[ _ _ combine-hashcodes _ mod ] ; inline
133 TYPED: bloom-filter-insert ( object bloom-filter: bloom-filter -- )
136 [ bits>> [ [ t ] 2dip set-nth-unsafe ] curry ]
137 tri compose each-integer ;
139 TYPED: bloom-filter-member? ( object bloom-filter: bloom-filter -- ? )
141 [ bits>> [ nth-unsafe ] curry ]
142 bi compose all-integers? ;