Last week, on August 10, a security researcher who goes by the handle "zerosum0x0" posted an interesting image to Twitter, it was the code behind a debug build of an executable.
The code was 'Hello World' – the training example used to teach new coders. When the executable was submitted to VirusTotal, several firms flagged it as a problem.
Salted Hash wanted to learn why training code was deemed malicious, so we asked the vendors to explain it. Here's what we learned.
The buzz around machine learning and artificial intelligence has grown over the last year or so in the security world. The vendors leveraging it are doing what they can to cash-in and improve performance to that they can be the undisputed champions of the market.
The experiment by "zerosum0x0" gained attention because the vendors flagging the code are advanced defense systems that promote their usage of machine learning.
The code in question can be seen in images on Twitter. Again, this is training code, something all novice coders use. Why then is such basic code being flagged as suspicious, harmful, or outright malicious by notable vendors such as Cylance, Sophos, McAfee, SentinelOne, CrowdStrike, and Endgame?
The example by "zerosum0x0" was just a single test with seven detections. Others, including one by a user under the handle "_hugsy_" removed the 'printf' function and was still flagged. Only this time, "_hugsy_" had eleven more vendors report that "Hello World" either unsafe, malicious, or a Trojan. Others did the same test, with the similar results.
Salted Hash asked "zerosum0x0" if they attempted to test with a non-debug sample, to which they gave an affirmative, but noted that it was hit or miss. "Theoretically machine learning can extrapolate more info from a debug build."
Again. Why? Why are these advanced offerings, from well-known and established security vendors flagging such basic, harmless code as malicious?
As it turns out, for some that's exactly what's supposed to happen. For others, it's because VirusTotal doesn't use the whole product. The default is to flag as suspicious it seems.
Before we get to the vendor explanations, it's worth noting that VirusTotal has always maintained that it's not the right tool to perform comparative analysis on security products. That isn't the point of this article, but we were curious about the results posted.
Salted Hash reached out to the vendors that flagged "Hello World" in some way including, Cylance, Sophos, McAfee, SentinelOne, CrowdStrike, Cyren and their consumer product F-Prot, Endgame, F-Secure, and Bitdefender.
We asked for comments to explain why "Hello World" was flagged, and asked for details on what they're doing to keep false positive rates low for customers. All but three vendors responded by deadline.
McAfee, F-Secure, and CrowdStrike did not respond to the initial requests for comment. When this fact was mentioned in public, F-Secure and McAfee reached out to us directly, but didn't provide comment by the time this story went live. Update: After this article was published CrowdStrike, F-Secure, and McAfee responded to questions.
Below are the comments from the vendors who responded. Some of their answers have been edited for space.
Ryan Permeh, Cylance:
"The Cylance engine is not an antivirus engine. Unlike AV, it doesn’t have a bias toward letting everything run. The technology doesn't assume a file is good until it’s evaluated. Our approach is to measure and decide on each and every file individually, and if it doesn't fit into our model of good, it leans towards bad.
"Without a bunch of data to base a decision on, and without any real patterns of goodness to identify it as such, the engine leaned heavily on the structural bits that are odd and drew a line towards bad in this case.
"When we train models, we train on hundreds of millions of good and hundreds of millions of bad files (samples). We look at several million potential data points (features) in each file...
"...In general, a piece of code can become "bad" by doing things that lean towards bad. But it can also lean towards bad by not doing things that lean towards good. So in the most basic example provided (hello world in debug build):
"The sample was small. It didn't show any bad, but it didn't show any good either; One function programs are almost always malware; Debug builds are statistically weird; Using mingw rather than visual studio is statistically weird. The output binary is 'odd.'"
Hyrum Anderson, Endgame:
"Before Twitter caught ablaze with these “hello world” samples, our own internal research indicated that our and other models were susceptible to these toy samples. Let’s explain why.
"Endgame’s machine learning malware detection uses static features to determine before a customer executes a file whether it is likely malicious or benign. The machine learning model is an imperfect summarization of tens of millions of malicious and benign software on which the model was trained.
"As an imperfect model, it can obviously be wrong, but still extremely useful in detecting never before seen malware, far more useful than approaches which rely on signatures for already known malware families.
"For the case of our model and other machine learning models based on static features, the model can be wrong in this case because, in the training dataset, the model has seen:
"Lots of real malware samples that are small unsigned binaries; lots of real malware samples where the entry point (.text) section is small, like droppers unpacking stubs; lots of real malware samples that attempt to hide their imports from static analysis by some method, so that their import table looks very small.
"On the contrary, there are very few “useful” benign files that are small, certainly too few to contradict the above experience.
"It’s important to note that machine learning is actually quite good for prevention and detection malware, both novel samples and the more well known. Endgame was one of the only few to get NotPetya in VirusTotal, for example. That said, all machine learning models have blind spots (false negatives) and they can mistakenly call things bad (false positives). In fact, we’ve shown in our published research that for some machine learning models, these vulnerabilities can be quite convenient to exploit...
"...At Endgame, we employ a strategy of layered protections that align with a large number of commonly seen attacker actions. Our MalwareScore engine (released standalone in VirusTotal) represents only a single slice of that layered protection paradigm. The layers work in concert to alert our customers of potential threats (reducing FNs), and working together to build a complete story of a potential threat (reducing FPs).
"Fortunately, the samples highlighted on Twitter are interesting corner cases, but are extremely esoteric for our customer base. Nevertheless, we continually are doing more research to improve our detection ratio and reduce our false positive rate. This involves data gathering to increase our model’s understanding of the universe of benign and malicious software as well as a huge amount of experimentation effort to maximize our model’s performance. We put a great amount of attention on addressing known false positives seen by our customers. As a result of these efforts, we regularly release models to our customers and to VirusTotal. And, we continue to work with 3rd parties to validate our model’s performance on real files."
Dr. Sven Krasser, CrowdStrike:
"There are two important aspects to understand. First, the machine learning models for static file analysis we use at CrowdStrike are optimized to detect malware, especially novel families that bypass signature-based approaches, while avoiding interference with legitimate business applications. However, unusual and artificially constructed files fitting into neither of these two categories are occasionally detected as well.
"For this reason, we expose confidence values and allow customers to set their own thresholds. While in this instance our file analysis engine was arguably too aggressive, generally this behavior is by design: if a file does not look like a legitimately useful application while also exposing unusual traits, then the sound call is to prevent it from executing. Avoiding odd looking yet potentially benign objects should be a familiar concept should you have ever opened an office fridge before.
"Second, static file analysis alone (i.e. what most vendors provide on VirusTotal) is simply not a sufficient security tool on its own. It is easy to create files that behave benignly yet are detected by both signature and ML-based engines. It is, however, also possible to create malware files that bypass detection. That is trivially possible for signature-based engines, but one can also bypass ML-based static file analysis with some effort. Therefore, CrowdStrike Falcon uses static file analysis as only one of many techniques to detect threats while combining it with several other layers of defense, such as advanced Indicators of Attack."
Raj Rajamani, SentinelOne:
"SentinelOne uses multiple engines for prevention that work holistically on the full SentinelOne Endpoint Protection Platform (EPP) including a static machine learning (ML) engine (the one on VirusTotal) and a dynamic ML engine (which is only available on the agent).
"Each engine uses ML techniques to classify files and events as threats (hi-confidence detections) or suspicious (low confidence detections), and together they enjoy full system context. VT does not support confidence scores, so even low-confidence detections (aka suspicious) are marked as threats/malware in their feed.
"In this case, the binary was detected as suspicious, probably because it was compiled in debug mode. When the same code is compiled in release mode, we do not detect this as suspicious. This is normally not an issue in production environments as customers review suspicious items before mitigating them.
"We take false positives and false negatives very seriously and approach them from a few angles: 1) by making it easy for customers to whitelist suspicious files; 2) we have a team of threat researchers at the ready to assist in hunting and classification of suspicious files and threats and; 3) we ship a new machine learning-engine every month. We will continue to work to improve our false positive and false negative detections with every release.
Jarno Niemelä, F-Secure:
"This is actually a thing that pops up every couple years or so. What is going on in that AV industry has been using "Next gen" ML sample classification systems for over 10 years already. There is nothing "Next gen" in most of the new vendors.
"And the ML systems we use analyze samples based on the features we extract from the sample, most features are collected statically, and some are collected by running the sample inside an emulator and observing the code as it runs. The features are collected from millions of samples that are already classified as malware as well as from tens or even hundreds of millions of clean files.
"And what you end up with is a system that will high rather classify malicious and clean files correctly. New instances of known malware types and samples that do things that are known for malware will be classified as malware. And clean files are unlikely to be misclassified.
"These analysis results are then packaged into "AV signature database" that depending on the company and a product either resides in the cloud or is shipped to all end users, and may either contain actual signatures or trained ML database, depending on the company and engine. And many times in addition of ML generated detections, the signature database also may contain human created detections, which are usually much more generic and powerful. Although those are frequently omitted or unable to function in VT.
"However if a sample doesn't have almost any features that it could be classified with, these systems are prone to classify them as malicious, due to fact that it is very rare for a clean file to do nothing and have no features that would be common among other clean files. Which leads to this "Hello world" issue people encounter once in a while.
"This is seen as erring to safety, as the sample in question is very unlikely to be of any significance, and may well contain something malicious that we do not have feature extraction for.
"Especially as this file may not ever trigger a false alarm among real customers, as our products have false alarm mitigation mechanisms that take other things into account than just the file analysis system result, and thus what is shown as a false alarm in VT does not trigger for actual user.
"And even if a false alarm triggers on some user, the notification about triggering will be sent to our cloud and if we get any metadata that might indicate a false alarm we will automatically analyze the sample more thoroughly. Which means that actual false alarms hitting real end users are much more rare than ones seen in VT, and are usually very short lived.
"And then to the particular sample you linked in VT. In this case the false alarm was caused by a licensed component, whose vendor identified and fixed the FA even before our systems picked the file for analysis. And obviously none of our customers queried about the file, so I cannot really say would we have been able to mitigate the false alarm automatically before the first customer asking about the file."
Bogdan Botezatu, Bitdefender:
"Bitdefender uses multiple layers of detection, including a solid reputation system to flag unknown files. Such files may include freshly compiled applications, even if they are harmless to the user. The reputation system looks suspiciously at any files that have never been seen in the wild, but other layers of technologies in the full Bitdefender product usually go deeper into the file and lets it run even if it was initially blocked at stage 1.
"However, as VirusTotal only aggregates some scanning technologies, unlike a regular Bitdefender security suite, some results may be inaccurate or inconclusive. This is why the VirusTotal FAQ strongly advises against bench-marking security solutions based on the VirusTotal results.
"Regarding the false positive issue, this kind of detection is not something the user, or the tester, would see in a real life scenario, as we use a complex mix of technologies to get an accurate classification of the file."
Vincent Weafer, McAfee Labs:
"The examples in the article all seem to be associated with McAfee gateway scanner heuristic engine, which is by design set up to detect both malicious and suspicious files passing through enterprise gateway systems. This engine is not part of the enterprise endpoint or consumer solutions, which have different detection profiles compared to a gateway heuristic engine, for a defense in depth solution.
"The full implementation of that heuristic engine in the gateway or cloud, includes other trust/reputation technologies that enable filtering out known clean files for false avoidance or performance reasons. In this case the samples tested were simple programs, which were not code signed or had other trust indicators, so a heuristic type engine would likely detect them in this context."
Cyren (also F-Prot):
Note: Cyren confirmed they were initially blocking the Hello World code. On August 10, they were notified "of the possibly erroneous classification the same day it appeared." After an analyst looked at it, the scoring/classification was adjusted on the same day.
"Our massively automated detection network, driven by complex machine learning, does on occasion score an object as suspicious, when it might in fact be benign. We are committed to minimizing false positives and rapidly fixing them once identified – this is as important to us as blocking threats correctly." - Michael Tamir, VP of Support Services, Cyren.
Chet Wisniewski, Sophos:
"SophosLabs uses advanced machine learning technology and expert researchers categorize suspicious file samples. When a file is incorrectly categorized our data science team work quickly to rectify it and re-calibrate the algorithms. Our researchers are currently investigating the executable files you reference"