5 minute read

A few days ago, Hexacorn released a blog post taking a look at the NSRL RDS hash set. I’m a total fan of hash sets. I think they are one of the easiest ways to capture and reuse institutional knowledge. As such, I use RDS a lot.

Hexacorn’s post made me realize that 1. I’d never really questioned the RDS before, and 2. I was wasting valuable CPU cycles! Both of those end today! The goal is to explore a bit deeper and hopefully create a more efficient NSRL for specific #DFIR use cases.

Use Case

My primary use case for the NSRL and similar is to filter known-good from particular views in Autopsy and similar. As such, this statement jumped out at me:

…what we believed to be just large file hashset is actually a mix of files hashes and hashes of sections of executable files. Hexacorn

Sections of executable files might be relevant for binary/malware analysis, but I rarely use them. What’s more, the filtering tools that I use don’t do partial hashing. It’s the whole file or nothing. So this set of partials is a complete waste and will be our main target.

Hexacorn seems most interested in executable file types. I’m interested in any whole-file, known-good. I don’t want to see system files.


I’m using NSRL Modern RDS (minimal) v2.75. Note that v3 uses SQLite instead of a flat-file. We will have to look into that later.


Number of hashes

$ cat NSRLFile.txt | wc -l

Number of operating systems:

$ cat NSRLOS.txt | wc -l

Number of Windows OS versions:

$ cat NSRLOS.txt | grep -i windows | wc -l

Number of (likely) Java embedded .class files:

$ cat NSRLFile.txt | grep -i "\.class" | wc -l

Quick search for “text”:

$ cat NSRLFile.txt | grep -i text

Quick search for “1”:

$ cat NSRLFile.txt | grep -i "\"1\"" | wc -l

Entries that start with “.”:

$ cat NSRLFile.txt | grep -P "\"\..*?\"" | wc -l

Entries that start with “__”:

$ cat NSRLFile.txt | grep -P "\"__.*?\"" | wc -l

Check top system codes:

cat NSRLFile.txt | awk -F"," '{print $7}' | sort | uniq -c | sort | head

We also found system code 362 "362","TBD","none","1006", which a lot of entries use. Looks like to be determined… meaning they don’t know the OS? Entries look like Microsoft stuff, plus some of our code segments:


A LOT of entries are listed under 362. So many that we should filter them, but also save them into an “other” category.

Files that are not 362:

$ cat NSRLFile.txt | grep -vi 362 | wc -l

Ah, well, that’s not good. All entries have a, OS category of 362. Meaning OS categories, for now, are totally worthless.


Based on this exploration, there are a few things we might want to do.

  1. Entries that start with __ and do not have an extension are probably section hashes. Exclude
  2. Entries that start with . are also probably section hashes, though for Linux and MacOS this may not be true. Exclude
  3. Entries with .text, text and 1 have many entries, but it’s unclear why. Exclude?
  4. Split the sets by general OS instead of all-in-one. Select set based on needs. No OS cat info avail.
  5. Filter very old OS
  6. .class files - most likely embedded in java. Exclude?

Testing speed assumptions

All of this assumes that reducing the hash set has some reduction in CPU cycles / time. Let’s test that.

Take half of NSRLFile.txt.

$ head -n 20925181 NSRLFile.txt > HalfNSRLFile.txt
$ wc -l HalfNSRLFile.txt 
20925181 HalfNSRLFile.txt
$ wc -l NSRLFile.txt 
41850362 NSRLFile.txt

Create an hfind index for both. Note we’re using the NSRL-SHA1. Matching with MD5 is faster but too easy to get collisions. Plus, by filtering NSRL we can be more efficient, faster and more accurate.

$ hfind -i nsrl-sha1 NSRLFile.txt 
Index created
$ hfind -i nsrl-sha1 HalfNSRLFile.txt 
Index created

Create some SHA1 hashes from a test dataset.

$ sha1deep -r ~/Desktop/Cellebrite_Reader/ > testHashes.sha1
$ wc -l testHashes.sha1 
109950 testHashes.sha1
$ cat testHashes.sha1 | awk '{print $1}' > hashes.sha1

Measure time of full DB lookup.

$ time hfind -f hashes.sha1 NSRLFile.txt >/dev/null

real	0m37.942s
user	0m0.580s
sys	0m3.391s

Clear cache and measure time of half DB lookup.

$ sudo sysctl vm.drop_caches=3
$ time hfind -f hashes.sha1 HalfNSRLFile.txt >/dev/null

real	0m9.707s
user	0m0.319s
sys	0m1.306s

Unexpectedly, half the NSRL hashes took approximately 25% of the (real) time of the whole set, and 38% system time. This was only for 100k file hashes. On a normal case, we will see some big improvements by reducing the set as much as possible. This assumes you were dumping the full NSRL in your tools (like I was!).

Building a filter

First, we get the Windows codes.

cat NSRLOS.txt | grep -i Windows | grep -v "X Windows" | awk -F"," '{print $1}' > temp.txt

Most OSs filtered out easily. Unix/Linux gave some trouble, of course. At this stage, we also removed some of the much older OSs.

cat NSRLOS.txt | egrep -vi "(windows|android|ios|mac|msdos|ms dos|amstrad|netware|nextstep|aix|compaq|dos|dr dos|amiga|os x|at&t|apple)" | awk -F"," '{print $1}' > temp.txt

This gives us the codes for Windows, Mac, Android, iOS, Unix/Linux, and an other category of 362. Note, everything is 362 so filtering by OS isn’t useful at this time.

Filename filter

Filenames are pretty much our only indicator of a segment or not. In that case, we can search based on starting with __ or .

Not ideal, but unless we parse out and somehow rationally filter on product, this is as good as it gets.

if FN.startswith("__") or FN.startswith("."): continue

This is easy to expand with regular expressions if other meaningful filename patterns are found.

Filtering end result

As of this writing, we’re basically just removing files that start with __ and period. The filter as-is can be replicated much faster in bash:

$ cat NSRLFile.txt | grep -v "\"__" | grep -v "\"\." > ENSRL.txt
$ wc -l ENSRL.txt 
34976616 ENSRL.txt

This is about 83% of the original hash values. Even with that we can expect some processing-time improvements with a low likelyhood of missing many filtering opportunities.

A filtering tool and Efficient-NSRL

The ENSRL can be found here: https://github.com/DFIRScience/Efficient-NSRL.

Future work will try to fine-tune based on the file name and might try to use NSRLProd.txt. It would be nasty, but that seems like the only useful categorization data.

Hit me up on Twitter if you have any filtering recommendations. Pull requests also welcome.