Home Features Demos Download Installation User Manual Developer Manual Relation function Credits

Login

Bloom filter

Bloom filters use a hash to index the text of the records. Bloom filters are space efficient at the cost of false positive.

To index, the text is converted to the URL-representation of the text, then each trigram is hashed using the FNV algorithm to create three values which are then set on a bitmap of the size 1024.

To check a term against the index, it is itself converted to its URL-representation and then each trigram hashed three times.

The bitmap of all records are saved as binary in the bloom.raw file. The strucure is optimised for reading: Blocks of 1024 rows (the hash values) and 8192 columns can handle 65532 records at a time.

A block has therefore a size of 1024 * 8192 bytes = 8 MB.

The term may be in the record, if all bits of the hashed term are present in the bitmap of the record.

1.7.0 adds the bloom filter to replace trigram files.