File modifications happen for a number of reasons, the most innocuous one being data corruption or inadvertent partial downloads. Both scenarios often result in non-working files. However, attackers and viruses manipulate original files in a manner that they still work, but additionally execute their own malicious code. In some cases the malicious code is not even there anymore because the files have been cleaned by antivirus software, but the indications of manipulation remain.
Regardless of the reason that these manipulations occur, being able to identify them is important to avoid instability, less secure systems and system infections.
1. Obtain the original and look out for winking bytes
The best way to detect manipulation is a direct comparison with the original file. There are repositories that enable you to obtain original files from Windows systems, e.g., VanillaWindowsReference by Andrew Rathbun. In case the suspect is not a Microsoft file, the original file can often be downloaded from the developer's website. Just make sure to get the exact same version as noted in the version information of the file.
A first comparison step is done via hashes like SHA-256. If these do not match, binary diff tools like VBinDiff will show which specific bytes are different and allow for closer inspection. Colleagues of mine named these "winking" bytes because when switching tabs to-and-fro between the original and suspect file in a hex editor view, the divergent bytes create something resembling a wink.
2. PE Checksums and reproducible builds
PE files may contain several checksums which are useful to detect manipulation. While some of them are well-known, others might be surprising.
The Optional Header checksum is always present, but certainly the least useful. The reason is that Microsoft files built in recent years don't have valid checksums in their Optional Header anymore. However, if you happen to find a file with a valid checksum and one without, the valid one is probably the original file.
@namazso made me aware that Microsoft files created within the last few years generally use the /BREPRO flag for the linker. That flag is used to create so called reproducible builds, which means that different builds based on the same source will have the same hash for the resulting binary. To accomplish this PE-related timestamps and the Optional Header checksum are replaced with a fixed value. This does not only lead to unusable timestamps but also an invalid Optional Header checksum. The /BREPRO flag adds an entry to the debug section of the debug type IMAGE_DEBUG_TYPE_REPRO (0x16) and this entry in turn contains a SHA-256 hash value. As there is no documentation or name for this hash yet, it will henceforth be referenced as repro hash.
For all files that were created with /BREPRO, manipulations can be detected by checking the validity of the repro hash value. I am not aware of any tool that can validate the repro hash at present, but I am going to add this validity check as a feature to PortexAnalyzer for the upcoming release. The algorithm is described in more detail by @sixtyvividtails.
The checksum of the certificate is only present if the file was signed. This checksum is called authentihash and shown on VirusTotal. Invalid checksums lead to an invalid signature. VirusTotal as well as tools like Sysinternals sigcheck.exe show the validity of the certificate. However, the authentihash does not cover the certificate table which has been abused by malware authors. Some files have encoded malware in the certificate section, see also Code Signing: How Malware Gets a Free Pass. VirusTotal currently shows such manipulated files as having invalid certificates, whereas sigcheck.exe does not.
The Rich Header checksum is the same as the xor key which is used to encrypt the Rich Header. If this checksum is invalid, it means the Rich Header or DOS stub has been manipulated after linking. Some packers do this to the files that they pack but malware developers may also manipulate this area on purpose.
3. Original files have higher distribution numbers
Manipulated files are generally not as wide-spread as the original version. So you can use this to your advantage and search for similar files with the same version number and compare their distribution numbers. E.g., VirusTotal has a similar samples search and VirusTotal's submitters value is an indicator of said distribution.
The result below shows one file with 600 submitters and four other files with less than five submitters each. The executable with 600 submitters is in stark contrast to the other files, indicating higher distribution numbers. Additionally this is the only validly signed file, so that we can conclude that the file with 600 submitters is most likely the original one.
4. Patches in the Section Table
The best way to identify PE file manipulation is a static PE parser that shows the header contents. Suitable tools are, e.g., PortexAnalyzer (CLI) and PEStudio (GUI).
Many viruses change the section characteristics, so they can place their own code inside a file. Even if such an infected file is cleaned by an antivirus program, the modified headers will still be present because the cleaning tool cannot know what the header looked like before the virus changed it.
A prominent red flag for non-packed files is the presence of write and execute characteristics in a section, especially if that is the section containing the entry point. Most of the time write and execute characteristics do not appear together in a section in non-packed files, whereas it is rather typical for packed files. The presence of both means the code itself can be changed dynamically. The section characteristics are shown by PE parsers in the section table.
5. Fractionated imports
Another virus-typical behaviour is the introduction fractionated imports. Generally all imports are placed in one section, but if they are spread over different sections, they are called fractionated. Some viruses add imports deliberately to make sure they can use certain system APIs. The fractioning is an unintended side-effect from placing the imports at a virus-convenient location that is usually not near the imports of the original file.
The tool PortexAnalyzer automatically detects such imports and marks them as an anomaly.
Fractionated imports are also common among certain packed files, but in such cases the imports are often scattered throughout the whole file as in the image below. These are not indicative of virus modifications.
6. Linker mismatch
If the Rich Header is present, a linker mismatch between the Rich Header linker and Optional Header linker version indicates manipulation of the headers. Since the Rich Header is a means to attribute samples to threat actors, malware authors might get the idea to swap the DOS Stub and Rich Header with those of other threat actor's samples. In those cases the Rich Header checksum will still be valid but the linker versions might not match anymore.
VirusTotal highlights linker mismatches using a warning, but said warning is not always accurate. There seem to be issues with newer or unknown Visual Studio versions, so I suggest to rather compare the linker versions yourselves. PE parsers like PortexAnalyzer and PEStudio are able to interpret the linker version in the Rich Header. Compare this value with the Optional Header major linker version and minor linker version, e.g., by using PortexAnalyzer to get a textual representation of the version numbers.
7. Truncated files
Files may be truncated, e.g., from partial downloads or because threat actors cut the invalid certificate table from the end of the file. Depending on how much has been cut, this does not always result in file corruption. But it might show up in the section table or data directory.
For truncated files, the section table may define sections whose raw pointer + size values point to somewhere beyond the end of the file. The data directory may point to structures outside of the file as well. Both are treated as anomalies in PortexAnalyzer reports.
The repro hash needs more attention
Of all these ways to identify file manipulation the most surprising for me was the repro hash. Apart from the tweet by @sixtyvividtails the hashing algorithm has not been documented yet. But once it is built into tools like PortexAnalyzer and services like VirusTotal, it will be a great asset for malware analysts.
Thank you for your contributions regarding the repro hash @namazso and @sixtyvividtails