Why is it necessary to use machine learning in the antivirus fight?
In our everyday work we keep coming across the same malware, but it is constantly being repackaged. So there are a great number of new samples, with Emotet for example about 30,000 in the first half of the year. However, the number of malware families is limited. So the fundamental question is, how can we identify the malware despite the change in packaging? Deep analysis in the main memory of the customer's computer can be used as a technical solution. However, this is very resource-intensive, so we cannot apply it to all processes all of the time.
Therefore, in order to optimise the deep analysis of the memory, we considered developing a smart pre-filter. This is where we opted for machine learning. After all, we often know what malware looks like. We used this knowledge to train a perceptron - a neural network - with the aim of detecting malware more quickly.
It was not so much a matter of using this pre-filter to clearly identify files as malicious, as it was of finding deviations from the norm that were potentially suspicious. Machine learning is a good lever for comparing a variety of properties of clean files with those of malware files, such as file size, and separating good files from potentially malicious ones. A classifier indicates interesting (i.e. potentially dangerous) files. Only then does a memory analysis of the potentially dangerous files and associated processes take place.
Which machine learning approach did you select? Are there already finished plans in existence?
No, we had to take action here ourselves. The research team redeveloped an existing, simple neural network - a perceptron - to identify malware. This proved to be a major challenge during the course of the project. Active attackers naturally try to ensure that their malware always looks statically different.
That's why we opted for a combined method. Our developers tried different approaches during this phase. We initially worked with only one percent of the data, which greatly shortened the testing and development phase. Then we used the entire database. Even though the analysis of the files ultimately takes place on the customer's computer, the perceptron is trained in our specially developed backend. We had to seriously upgrade our own hardware and make changes to our existing backend systems.
Were there moments when you were truly surprised?
We immediately achieved interesting results with the initial training sessions. However, some of the data came as quite a big surprise for us - but we couldn’t explain the deviations immediately. In the course of our further work we noticed that the feature extraction had a small error. One of the key challenges in the development process is to discover such errors in machine learning data. Overall, however, the detection rate in the training sessions was consistently above 98 percent.
We have defined appropriate determinants to further improve this ratio. If, for example, executable files are downloaded from the Internet, we automatically classify them as interesting and examine them in detail - regardless of what the perceptron says. Here, for example, we follow the Windows zone identifier, which asks if a file should be executed after the download.
How many malware families did you use in the training phase?
The training sessions are a dynamic process. We have always gone back a defined period of time from the start of a training phase. All malware families that were active during this period are included in the training. At the same time, we try to stay as up-to-date as possible. Therefore, almost as soon as one training phase comes to an end, a new training session starts. From this, we then derive what amounts of data are required and how historical data can help. Often, things didn’t go quite as we expected in the training sessions. Subsequently, we had to find new answers as to why the system weighted certain features differently than expected. This is certainly due to the large number of parameters.
Can you say what was the most interesting learning experience for you in the development phase?
From a development perspective, working together for the first time in this configuration proved interesting. A total of five teams at G DATA were involved in the development of DeepRay. I really enjoyed this cross-team collaboration.
As a new project without any special dependencies or legacy issues, we were able to develop a fully functional and easily expandable prototype in a very short time using rapid prototyping and test-driven development. This enabled us to thoroughly evaluate the effectiveness of DeepRay against real malware. Through this approach we were able to react to the latest and/or experimental techniques by the malware authors very early on, and then during development, and ensure that DeepRay could withstand every challenge on its release.
What about the use of DeepRay in everyday life? Does DeepRay now assume an early warning function?
Thanks to the proactive components with DeepRay, detection is now much faster even in the traditional reactive components. When DeepRay categorises a file as potentially dangerous, we immediately check it in our analysis backend and add files detected as malware to our blacklist. Several components work closely together to do this - and all within a few minutes.
We know that cyber criminals are also testing the new packaging on the popular antivirus software available on the market - such as our Total Security - until the static analysis no longer detects the packaging. In the Darknet, you can even book services that do nothing but package malware accordingly. That was a decisive reason for creating DeepRay - with purely static systems, you can no longer deal with dynamic attacks.
What is your conclusion on the development process?
We already had a clear plan at the start of the project and we implemented it. The cross-team development certainly helped a lot with the success, because everyone pulled together. In the end, we implemented a very simple idea in a quick way and cleanly integrated it into our systems. The chosen perceptron is just one piece of the puzzle – albeit a very critical one. The results in the training sessions confirmed from the beginning that the chosen path was leading to the desired goal. It was less a revolution than an evolutionary development.
Are there any plans for further development of DeepRay?
Phase 1 has been completed. The backend training sessions run largely autonomously and we have established a system that works independently. Of course, we regularly check the training progress and the detection rate in the training sessions. And, for sure, there is still some potential for optimisation here and there that we are working on. But basically the system works all by itself. In parallel, the research team is also working on ideas to further develop DeepRay, for example to optimise the detection rate further or keep improving performance.