• United States




The 5 best malware metrics you can generate

May 22, 201810 mins
Endpoint ProtectionMalwareSecurity

Are you asking the right computer security questions? If you can answer these five, you'll know better how to secure your organization.

security risks in the IoT [internet of things] network
Credit: Thinkstock

One of my favorite quotes from Albert Einstein goes:

If I had an hour to solve a problem and my life depended on the solution, I would spend the first 55 minutes determining the proper question to ask, for once I know the proper question I could solve the problem in less than 5 minutes.”—Albert Einstein

A big problem in the computer security world is that practitioners aren’t skeptical enough, don’t question purported authority statements, and often don’t ask the right questions. It’s a theme I see over and over, and it leads defenders to enacting the wrong computer security defenses or worrying about the wrong metrics.

Many defenders are asked to come up with hundreds of controls and metrics that are supposed to accurately define the security risk of their environment. A handful of controls, like those around social engineering and patch management, will quantify the vast majority of computer security risk in most environments. Even then, for those controls, most defenders get it wrong.

For example, defenders often think that they need to do 100 percent patching on all computers, especially concerning the Windows operating system, to be secure. The truth is that patching a few internet browser add-ins on workstations and patching web and database programs on servers provides more risk protection than patching the OS or any of the other of hundreds of programs you must worry about. There are outlier attacks, but they are just that…outliers, and they don’t define most of the risk.

Another example is malware statistics. Whether or not you believe the long heralded “antivirus is dead!” mantra, you’d probably be seen as crazy if you didn’t run a traditional antivirus program or one of the newer endpoint detection and response (EDR) products. Anti-malware products are not perfect, but they do catch a lot of the threats headed for our networks and devices, so nearly everyone runs one.

If anyone asks how well your anti-malware product does, you can point to the vendor’s “100 percent” detection claims or show how many malware programs their antimalware program removed over a time period. Many people believe that if the malware program shows an increased number of detections over time, the product is being more accurate. Without knowing how many potential programs it should be detecting versus what it detected, we don’t know how accurate the program really is.

However, it’s the wrong question to ask. Like a firewall that drops a blocked packet, once an antimalware program has detected a malware program and removed it, the threat is gone. The better question regarding malware detection and removal is how long the malware program was present before it was detected and removed. Nearly all the risk of a malware program is what it did, or what the user did during the time before it was detected and removed.

If the antimalware program detects an incoming malware program as it is being downloaded to a device and removes it, there is almost no risk to the device or network. It’s just like when a firewall drops a packet before it could get inside a device. The real risk is all the time that the malware program (or hackers) went undetected (often called dwell time) before it was stopped and removed.

If it took days and the device or user was performing high value/risk operations, then the security risk would be scored as much higher by most observers. Using traditional malware removal metrics, the immediate removal of a malware program and a malware program that took three weeks to be detected and removed would both be counted as a single removal instance.

So, are you asking the right questions of your antimalware programs? Does your antimalware program tell you how long a malware program was active on a PC before it was detected and removed? Does the risk of a particular user or device play into calculating your company’s security risk? If not, here are the five questions and metrics that I think you should be measuring regarding your antimalware program’s success and accuracy.

Here are five antimalware metrics any computer security defense should be tracking.

1. Mean-time to detect

Mean-time to detect is a metric that measures how long it took your antivirus or EDR program to detect a malware program or hacker from the moment it is successfully activated on a device. Most antimalware programs not only can’t give you that report, but don’t try to. Fortunately, in most cases, you can collect and generate the data with minimum effort.

First, you need to enable an application control/whitelisting program on each device you own. Microsoft Windows has been coming with a built-in version (Software Restriction Policies) since the days of Windows XP, although improved application control programs are available in many Windows editions (e.g., AppLocker, Windows Defender Application Control). There are a few open source products and myriad excellent commercial options. One of the best things you can do to protect a computer system is to run a whitelisting program. For our purposes, “audit-only” mode is all you need.

If your whitelisting program allows it (most do), run a “baseline” application audit, so that only programs and processes added after the baseline will generate alerts or events. Then collect all events indicating that new programs or processes were added or installed to a database. In that same database, collect all detected and removed malware. Then, whenever a malware program is detected, run a report to find out when that malware program was first activated. The difference between the two dates/times is your mean-time to detect.

A malware program removed in under a minute or two probably isn’t much of problem. A malware program that took days or weeks to remove is much more problematic. If you see your mean-time-to-detect figure increasing over time, either on a particular device, or over the company in aggregate, it’s time to start talking to your antimalware vendor. You’ll be pleasantly surprised by how most antimalware vendors respond when you show them mean-time-to-detect data regarding their product.

2. Individual user risk

It is very useful to keep track of individuals that seem to get more infected than others. It’s not enough to document each and every malware detection, but also which people seem to be infected more than the average user and why. Is the person doing dumb things? Are they visiting particular websites that are infecting them more often?

3. Individual device risk

Sometimes it’s not the user; it’s the device or its setup. Is a particular device or common device configuration (software and hardware) seem to get more or less infected in a particular time period? Your Apple and Chromebook products are probably going to get less exploited than your Microsoft Windows devices, but that’s because Microsoft Windows runs far more software and is targeted far more often. All those devices, including Microsoft Windows devices, can be made secure by  patching and educating the user about social engineering. If your Windows devices are getting infected all the time, why? Who is messing up? Is the user doing some thing mistaken, or is it the device’s particular configuration? If you can identify a high-risk device configuration, can you do something to lower its risk?

4. What was performed during malware dwell time?

If the malware program was around longer than a minute or two before it was detected and removed, what did the user do during its undetected dwell time? For both of the last two metrics, individual user and device risk, it can be very useful to determine ahead of time if the user or device is considered a high-value/high risk device.

A high-value/high-value user is one that performs a wider administrative function than just as a normal user or has access to the company’s mission-critical “crown jewel” applications or data. For instance, users who are network or local admins are high-risk by definition. If they are a mission-critical software admin (e.g., Salesforce, a database, IT security, or C-level officer), they are performing high-risk actions. Is the device a high-risk workstation or server (e.g., server or domain controller)?

If the user or device is high-risk, you really want to find out what happened during the malware program’s dwell time regardless of duration. You can track this by investigating what programs were active and running during the malware program’s activate state, comparing against application logs, or simply asking the user. At some companies, if the mean-time to detect is longer than a few minutes or occurs on a high-risk user or workstation, the high-risk user or admin is automatically sent an email telling them about the infection details and then asks them to report what actions they performed during the involved dates/times. In some cases, the very highest-risk users and devices, if found with a malware program, are subject to a more detailed forensic inspection.

5. How did the malware program get in?

If you want to put down malware and hackers in your environment, you’ll have to figure out how they got onto your devices. Was it user error, unpatched software, zero day, social engineering, misconfiguration, man-in-the-middle (MitM) attack, eavesdropping, or physical attack? One of the best metrics you can get, although also the most difficult, is how the rogue program got in. Sometimes your detection software can tell you. Other times you might need to simply interview the user, with the incident details provided, and ask them how they think it may have happened. It won’t always be accurate, but you’ll be surprised by how only a few personal interviews can provide the necessary data to start seeing an actionable trend.

Here’s a bonus metric that is really an amalgam of the previous five.

6. Overall risk trend

Ultimately, senior management wants to see computer security risk go down over time. They really don’t want to know the details. They just want to know that whatever it is that you are doing is making the environment more secure this period over last period, or not. They’d prefer to see that in a pretty picture or a few numbers.

The Holy Grail of computer security metrics is a risk rating figure for each individual and device, aggregated across different departments, business units, and locations, which ultimately all are combined to report a single organizational value. That’s really what management wants!

You can use the individual values to dig into why a particular person, device, department, location, business unit, or configuration seems to be more or less susceptible to malware and use what you find to drive the overall risk figure down.

If your company is only reporting on total number of malware programs detected and removed each time period, choosing to implement any of these better metrics will help you reduce risk far faster.

Fight the good fight!


Roger A. Grimes is a contributing editor. Roger holds more than 40 computer certifications and has authored ten books on computer security. He has been fighting malware and malicious hackers since 1987, beginning with disassembling early DOS viruses. He specializes in protecting host computers from hackers and malware, and consults to companies from the Fortune 100 to small businesses. A frequent industry speaker and educator, Roger currently works for KnowBe4 as the Data-Driven Defense Evangelist and is the author of Cryptography Apocalypse.

More from this author