Visualizing String Hashes
Curiosity has led me once again down the road of the weekend project… I was doing some reading on different string hashing algorithms and felt the urge to see if I could visualize the grouping and collision tendencies of them, so I fired up Visual Studio and wrote a quick and dirty app.
How it Works:
The appropriately named “Buckets” is actually very simple. It works by taking a 600×600 image and creating a hash bucket out of each pixel in the image. It then generates a series of either incremental or random strings and runs those strings against the selected hashing algorithm. After the hash is computed, it finds the appropriate bucket and increments a value in the bucket to denote that a hash value was given to that bucket. This generated data is then used to create either a gradient map or three dimensional surface to visualize the different hash frequencies.
The results were actually surprising. In my first tests I only used randomly generated data (I was hesitant at first about simply using a generator, as some generators can produce patterns of their own, but a good hash should evenly distribute even non random data) and could not, despite my best efforts, produce any sort of discernible patterns even though I knew that some of the less capable algorithms should have tendencies. I’m still not fully sure this is correct, and if anyone has any ideas please feel free to let me know, but I eventually concluded that the set of all possible keys was just so large that I would need to try an extremely large number of keys before I began to see patterns emerge.
Next I decided to simply use predictable keys. I chose two different selectable methods for this. First was to simply loop an integer upward and hash the integer as a string. This had the benefit of having no alphabetic characters generated so it offered a different look on the algorithms performance. The second method was to simply walk a character string and convert the integer value into a corresponding string value. These methods began to yield much better results.
For the simpler hashes such as the additive hash or exclusive-or(XOR) hash the results were pitiful but the Bernstein and SDMB hashes began to show some very interesting patterns.
Its clear how the SAX hash is VERY dependent on key length and fails to produce well distributed hashes with smaller keys. I think it is also notable that under no circumstance was I able to get Google’s MurmurHash2 or MurmurHash3 to produce any kind of pattern. Maybe with a bit more computing power and running more hashes I can get some pattern to bubble up.
There are too many different possibilities for me to show here, but if you would like to see and play with the outputs yourself I’ve uploaded both the binary and sources for the app.
You can find the source for the project on the Disruption Theory github:
We welcome features and fixes so if you have one, please feel free to send a pull request.
Or download the compiled executable: