Storing Hashes instead of full tuples
2005-01-15 by billstewart2002a
There was a discussion a while back about storing hashes of a tuple instead of the tuple itself. It makes the live database much smaller, as much as a factor of 10, and also protects against attacks like overly-long source or destination addresses. The initial proposals suggested using MD5, a formerly popular cryptographic hash. However, for this application, MD5 is overkill, because you're not worrying about the sender trying to invert the hash. A simple CRC code is much faster to calculate, and you could also choose a shorter hash, e.g. 64 bits instead of 128. <p> A shorter hash does have an increased chance of collisions - birthday problem says a 64-bit hash probably gets one if you've got 2**32 tuples in your database, which is pretty unlikely except maybe for very big ISPs. But an occasional hash collision isn't a big problem, because the worst consequence is that a new spam message collides with a real message's tuple, so the spam gets in, but the tuple gets whitelisted safely. One extra spam in a few billion messages isn't going to break the system. <p> Some of the discussion suggested that the hash wouldn't be useful for wildcarding and destination-based or source-based whitelisting, but you can do those things before calculating the hash. There's also the issue of logfile analysis, but logfiles and the live database are really separate issues - the logfile can be a regular file that gets appended to, and you can store the hash with the rest of the records if it makes sense (e.g. you want to check how often a given source or destination's tuples get expired.)