Update (20.10.2022)
After publishing this post, we were notified that Joe Desimone (@dez_ on Twitter) used a very similar approach for hashing the TypeRef table in his ClrGuard (Code on Github) in 2017. He also presented his work on Derbycon 2017 (Video on Youtube).
Introduction
The ImpHash was introduced in 2014 by FireEye [1]. It has since been used by many malware analysts and implemented in tools like VirusTotal to identify similar malware samples by their imports. In theory, if programs use the same imports, they use similar source code.
.NET samples usually only import mscoree.dll, such that there is only a handful of different ImpHashes for all .NET binaries. Therefore, the ImpHash cannot be used here. This motivated us to find an alternative, the TypeRefHash (TRH). To show the imported DLLs, functions and the TypeRef table, we used the online tool penet.io.
.NET files store imported namespaces of their referenced types in a so-called Metadata table. We can use these to construct an identifier like the ImpHash. Similar to the combination of DLL/function name in the Import table, the TypeRef table contains a list with type names and their corresponding namespace. For example a .NET binary may import the type DebuggerBrowsableState from the namespace System.Diagnostics.
Calculation
To calculate the TRH we extract the TypeRef table and resolve the indices to the corresponding strings.
- Order the entries by TypeNamespace and then by TypeName.
- Concatenate the TypeNamespaces and TypeNames with a dash. In case that the namespace is empty, the concatenated string starts with the dash.
- Join all strings with commas and calculate the SHA256 hashsum of the resulting UTF8 byte-string.
We use SHA256, instead of MD5 which is used for the ImpHash, as we already see MD5 collisions on our data sets. We order the entries in the table to prevent attacks where a different TypeRefHash could be created for a sample by just reordering the table. A similar attack was shown for the ImpHash by Balles and Sharfuddin [2]. We chose a dash and a comma as the seperators, as they are not valid in namespaces and type names in .NET.
Imagine we have a .NET sample with the following simplified TypeRef table:
# | TypeName (Resolved) | TypeNamespace (Resolved) |
0 | CompilationRelaxationsAttribute | System.Runtime.CompilerServices |
1 | RuntimeCompatibilityAttribute | System.Runtime.CompilerServices |
2 | TargetFrameworkAttribute | System.Runtime.Versioning |
3 | DebuggingModes | |
4 | AssemblyFileVersionAttribute | System.Reflection |
This results in the following ordered and concatenated strings. It should be noted that TypeRefs that have an empty namespace are sorted to the beginning of the list.
-DebuggingModesSystem |
System.Reflection-AssemblyFileVersionAttribute |
System.Runtime.CompilerServices-CompilationRelaxationsAttribute |
System.Runtime.CompilerServices-RuntimeCompatibilityAttribute |
System.Runtime.Versioning-TargetFrameworkAttribute |
This is concatenated to the following final string:
-DebuggingModesSystem,System.Reflection-AssemblyFileVersionAttribute,System.Runtime.CompilerServices-CompilationRelaxationsAttribute,System.Runtime.CompilerServices-RuntimeCompatibilityAttribute,System.Runtime.Versioning-TargetFrameworkAttribute
The resulting TRH is the SHA256 hashsum of the above string.
63AE8074B4C2EF8E36FE3272BE23B445CEAB495E14877935C457E75CFB5E5A1E
You can find the TRH reference implementation in the PeNet library here.
Evaluation
How good can a TypeRefHash identify a certain malware family? To answer this, we evaluated .NET samples that we received mid May to mid June 2020 and looked at the corresponding hashes for seven families. We chose those, because we were able to collect a significant number of samples for each malware family, such that a meaningful evaluation is possible.
We looked at the following families:
Malware Family | # Samples |
AsyncRAT | 558 |
Blackshades | 5035 |
Bladabindi | 7793 |
DiscordTokenGrabber | 159 |
Nanocore | 1335 |
QuasarRAT | 517 |
RevengeRAT | 276 |
We inspected the distribution of different TypeRefHashes for those families. In the following figures the blue sections depict the most common TRH for that family. If the number of samples with the same TypeRefHash was equal or lower than five, we aggregated those TRHs in the shaded areas, to not pollute the chart.
We can see that in most cases one TypeRefHash dominates a family. Especially blackshades could be identified very successfully with the two most common TRHs comprising 97% of all analysed samples.
We evaluated the distribution for different malware families. The most common TypeRefHash for each family can be seen in the following table:
Malware Family | Most common TRH |
AsyncRAT | 4807b5cd7256fad54967dfe3c394c27d16bad1ac95b0306911a3546025bd6ccf |
Blackshades | 306db7dcdf4dd7bbf2eaa054a8c050fb97cbe84c0da87528c6e508ac5e11607b |
Bladabindi | 695409c18e59ff8a2c04f5572f61d35157ea1ce34e6f3db4975dfbaeb5d7e07f |
DiscordTokenGrabber | 6f917770f111b5e0f6bd7b1ccd3adf491fbc756bf031fe107233d3b19d4737d |
Nanocore | 31feea84c77a972ebe0bfc87ac90630ad824e91965b664c47d0d2b0761b30d16 |
QuasarRAT | 03d72f6a261029edbd5028d814b27b075f5c3c62219dbfe8a349998909d07b9a |
RevengeRAT | faaf850b8f9ce7eeed4c9d18b2fbd70ef1c9dde8d920c6e333829f3150d9ca08 |
The distribution can be seen in the following figures.
We can see that for five families, we hit the right samples in 100% of the cases. When looking at the most common TypeRefHash of QuasarRAT, we found one CardinalRAT sample, too. Only with RevengeRAT our results are a little bit more inaccurate, as we found 15 Bladabindi and one AsyncRAT samples. We also found two samples known to be clean. Therefore, the TypeRefHash cannot be used effectively for some malware families, like Revengerat.
Summary
As the ImpHash cannot be used with .NET binaries, we developed a similar method called TypeRefHash (TRH). The TRH is a SHA256 hashsum over the imported .NET namespaces and types. This is similar to the ImpHash, which is an MD5 hashsum over the imported DLLs and their functions.
Our evaluation showed that the TRH can be used to identify malware families with a similar precision as the ImpHash for non-.NET files. Depending on the family, the TRH can be unique for one malware family or can be found in multiple families.
You can find the reference implementation in the PeNet library here.
You can find a list with the samples used for the evaluation with their corresponding family name and TRH here.
A command line tool to compute the TRH on Windows and Linux can be found here.
References
[1]: https://www.fireeye.com/blog/threat-research/2014/01/tracking-malware-import-hashing.html (accessed: 17.06.2020)
[2]: Balles, C. and Sharfuddin, A., 2019. Breaking Imphash. https://arxiv.org/ftp/arxiv/papers/1909/1909.07630.pdf (accessed: 17.06.2020)
Disclaimer: The PeNet library and penet.io are both projects from one of the authors of this blog entry (Stefan Hausotte).