March 25, 2010

Nice Bloom filter application

Today I accidentally found a couple of interesting files in one of Google Chrome folders:
  • Safe Browsing Bloom
  • Safe Browsing Bloom Filter 2
Conclusion: Chrome uses Bloom filters to make a preliminary decision whether a particular web site is malicious or safe. Cool idea!

Let me explain how it works:
  • Web site URL is converted to some canonical form (or a set of such forms - e.g. by sequentially stripping all the sub-domains; in this case check is performed for each of such URLs).
  • N of its "long" hashes are computed (likely, there are 64-bit hashes).
  • The value of each hash is multiplied by scale factor. That's how N addresses of bits in bit array called Bloom filter are computed.
  • The filter is designed so that if all these bits are set, the site is malicious with very high probability (in this case Chrome performs its precise verification - likely, by issuing a request to the corresponding Google service); otherwise it is guaranteed to be safe.
  • More detailed description of Bloom filters is here.
The benefits of this method:
  • The size of such a filter is considerably smaller than size of any other structure providing precise "yes/no" answers to similar questions. For example, if it is a set (data structure), its data length should be nearly equal to the total length of all the URLs of malicious sites it "remembers". But a Bloom filter providing false positive responses with a probability of 1% (note that it can't provide false negative response) would require just 9.6 bits per each URL of malicious web site it "remembers", i.e. nearly 1 byte! Taking into account that size of Chrome Bloom filters is about 18Mb and assuming they are really built for such a false positive response probability, this means they contain information about ~ 1 million of malicious web sites!
  • Bloom filter allows Chrome to use precise verification service practically only when the user actually goes to a malicious web site. Isn't it wonderful? ;)
P.S. I paid attention to these files mainly because our Xtensive.Indexing project contains excellent implementation of Bloom filters, and Xtensive.Core contains an API allowing to calculate 64-bit hashes. These hashers  are used by our Bloom filters to obtain the necessary number of 64-bit hash values (or, more precisely, the necessary number of values of unique 64-bit hash functions). Our hashing API provides 64-bit hashers for base CLR types and allows you to write them for your own types (you should just implement a Hasher for each of your own types; this is pretty simple, because hashers for field values are already available).