- Safe Browsing Bloom
- Safe Browsing Bloom Filter 2
Let me explain how it works:
- Web site URL is converted to some canonical form (or a set of such forms - e.g. by sequentially stripping all the sub-domains; in this case check is performed for each of such URLs).
- N of its "long" hashes are computed (likely, there are 64-bit hashes).
- The value of each hash is multiplied by scale factor. That's how N addresses of bits in bit array called Bloom filter are computed.
- The filter is designed so that if all these bits are set, the site is malicious with very high probability (in this case Chrome performs its precise verification - likely, by issuing a request to the corresponding Google service); otherwise it is guaranteed to be safe.
- More detailed description of Bloom filters is here.
- The size of such a filter is considerably smaller than size of any other structure providing precise "yes/no" answers to similar questions. For example, if it is a set (data structure), its data length should be nearly equal to the total length of all the URLs of malicious sites it "remembers". But a Bloom filter providing false positive responses with a probability of 1% (note that it can't provide false negative response) would require just 9.6 bits per each URL of malicious web site it "remembers", i.e. nearly 1 byte! Taking into account that size of Chrome Bloom filters is about 18Mb and assuming they are really built for such a false positive response probability, this means they contain information about ~ 1 million of malicious web sites!
- Bloom filter allows Chrome to use precise verification service practically only when the user actually goes to a malicious web site. Isn't it wonderful? ;)
P.S. I paid attention to these files mainly because our Xtensive.Indexing project contains excellent implementation of Bloom filters, and Xtensive.Core contains an API allowing to calculate 64-bit hashes. These hashers are used by our Bloom filters to obtain the necessary number of 64-bit hash values (or, more precisely, the necessary number of values of unique 64-bit hash functions). Our hashing API provides 64-bit hashers for base CLR types and allows you to write them for your own types (you should just implement a Hasher for each of your own types; this is pretty simple, because hashers for field values are already available).