Catching a RAT by the tail

Detecting data exfiltration with machine learning

La Tarte au Citron (Creative Commons BY or BY-SA)

Last month I examined how machine learning could be used to detect low and slow insider threats. In this, the final installment of my trilogy on real-world use cases from the recent Verizon Data Breach Digest, I’ll discuss how remote access threats can be exposed with the machine learning techniques I’ve covered in my two previous blogs.

In this example, a manufacturing company experienced a breach of a shared engineering work station in its R&D department. A phishing email resulted in a Remote Access Trojan (RAT) backdoor being downloaded onto the system, which enabled the threat actors to escalate privileges and capture user credentials for everyone who had used the system. By the time the breach was discovered, a significant amount of information had been leaked out via FTP to a foreign IP address.  

Most companies understand that data breaches are inevitable. It’s no surprise that spending on cyber security tools has expanded from traditional “prevent and protect” technologies to include post-breach “detect and respond” solutions in an attempt to control and manage unavoidable cyber-attacks. In addition, cyber-attacks are becoming more targeted, resulting in companies experiencing more damaging compromises that have a bigger impact on their business.

[ MACHINE LEARNING SERIES: Part 1 and Part 2 ]

The Identity Theft Resource Center (ITRC) has been tracking security breaches since 2005, looking for patterns, new trends and information to educate businesses and consumers on the importance of protecting identities and personally identifiable information. From 2005 through April 2016, the ITRC recorded 6,079 breaches, covering 862,527,023 identity records. That’s a lot of compromised identities!  

Today’s attacks compromise identity as a primary vector to pull sensitive information from an organization for financial gain or social notoriety. These attacks are sophisticated, better funded, more organized than ever before making it imperative for organizations to immediately analyze potential threats and risks related to anomalous and suspicious behavior.  

In addition to monitoring how identities are being both used and managed, other critical data sources within an organization’s computing environment should be examined to provide more context beyond who and what. Some example data sources include network access, event and flow data, DLP data, sys logs, vulnerability scanning data, log files from IT applications, etc. In many cases, this data may already be consolidated into a log event management or SIEM solution.  

This vast array of data, when combined with information on how identities are being used by both humans and machines, creates a rich source of “context” that can be mined using threat analytics and anomaly detection. When we view identity as a threat plane, hundreds of attributes can be modeled in machine learning algorithms to predict and remediate security threats.

Machine learning is a force multiplier. Rules-based detection alone is unable to keep pace with the increasingly complex demands of threat and breach detection. Primarily because rules are based on what (little) we know about the data, and generate excessive alerts. Since humans lack the ability to predict what future cyber-attacks will look like, we can’t write rules for these scenarios.  

In contrast, machine learning and statistical analysis can find anomalies in data that humans would not otherwise recognize or detect. For example, they can leverage useful and predictive cues that are too noisy and highly dimensional for humans and traditional software to link together.

Going back to our Verizon Data Breach Digest example, let’s consider how machine learning could detect a RAT. First, let’s clarify what we’re talking about. A Remote Access Trojan (RAT) is malicious malware software that runs in the background on a computer and gives unauthorized access to a hacker so they can steal information or install additional malicious software. Hackers don’t even have to create their own RATs, these programs are available for download from dark areas of the web. Trojans have been around for two decades, yet the term “RAT” is relatively new.   

RATs usually start out as executable files that are downloaded from the Internet, which are often masked as another program or added to a seemingly harmless application. Once the RAT installs, it runs in system memory and adds itself to system startup directories and registry entries, so each time the computer is started, the RAT starts too.  

How can machine learning help here? RATs generate anomalous data conditions from several system resources. Machine learning algorithms would detect this activity as atypical, since they represent system services or resources that are not “normally” running. In this case, machine learning algorithms can perform anomaly detection for machine-, not user-, based access and activity. Machine learning models can even compare “self versus self” and “self versus peer group” access and activity for machines and users using historical baselines to determine anomalies with high accuracy. If it’s not a normal condition, it’s an anomaly, and machine learning will uncover it – and catch the RAT by the tail!  

This article is published as part of the IDG Contributor Network. Want to Join?

New! Download the State of Cybercrime 2017 report