In a previous single, we looked at setting up and using SDHASH. After comparing modified files and, and getting a high score for similarity, we started wondering how well fuzzy hashing works on different sized images. So today, we have a little experiment.
First, we have 4 images. One original, and 3 smaller versions of the original.
|
Original: 75K, MD5 6d5663de34cd53e900d486a2c3b811fd |
|
1/2 Original: 44K, MD5 87ec8d4b69293161bca25244ad4ff1ac |
|
1/4 Original: 14K, MD5 978f28d7da1e7c1ba23490ed8e7e8384 |
|
1/8 Original: 3.6K, MD5 3e8e0d049be8938f579b04144d2c3594 |
So, if we have an original image, we can take the hash like so:
$sdhash kitty_orig.jpeg > kitty_orig.sdbf
Now, we want to take the hashes of other other files (manual way):
$sdhash kitty_2.jpeg >> kitties.sdbf
$sdhash kitty_4.jpeg >> kitties.sdbf
$sdhash kitty_8.jpeg >> kitties.sdbf
Now we can compare the hashes of the smaller versions to the hash of the original. Note: set the threshold to negative one (-t -1) if you want to see results below 1.
$sdhash -t -1 -c kitty_orig.sdbf kitties.sdbf
Unfortunately, but as expected, the results were not so good. Feature selection for the hash is done at the bit level, and those features do not carry over to smaller files since there are less bytes.
kitty_2.jpeg|kitty_orig.jpeg|000
kitty_4.jpeg|kitty_orig.jpeg|000
kitty_8.jpeg|kitty_orig.jpeg|000
If you were working with more images, and you wanted to hash and compare at the same time, you could use the -g switch. For example:
$sdhash -t -1 -g *
The output of which (in this case) would be:
kitty_2.jpeg|kitty_4.jpeg|000
kitty_2.jpeg|kitty_8.jpeg|000
kitty_2.jpeg|kitty_orig.jpeg|000
kitty_4.jpeg|kitty_8.jpeg|000
kitty_4.jpeg|kitty_orig.jpeg|000
kitty_8.jpeg|kitty_orig.jpeg|000
So, in conclusion, sdhash's feature selection does not allow for comparison of greatly different sized picture files. Note that a text file would be quite different, and would probably produce better results.