In my last post I promised to use some real-world use cases from the recent Verizon Data Breach Digest report to illustrate potential ways that machine learning be can used to detect or prevent similar incidents.
For my first example, I’ve chosen the case of a manufacturer whose designs for an innovative new model of heavy construction equipment were stolen following a social engineering attack. They were tipped off when a primary competitor, located on another continent, introduced a new piece of equipment that looked like an exact copy of a model recently developed by the victim company.
To paraphrase the Verizon report, it went like this. The threat actors identified an employee who they suspected would have access to new product design they were after — the chief design engineer. They targeted their victim via a spearfishing campaign that was based on a fictitious LinkedIn profile of a recruiter. The attackers began sending emails containing fictitious employment opportunities to the victim, one contained an attachment that had a malware file embedded in the document. When opened, the malware began beaconing to an external IP address used by the threat actor. The threat actors then installed a backdoor PHP reverse shell on the chief design engineers system. The rest, as they say, is history.
+ MORE FROM THE VERIZON REPORT: Verizon provides a behind the scenes look at data breaches +
As we reflect on this scenario, what intercept points could have been used to uncover the anomalous behavior occurring with the chief engineer’s account? One was the presence and availability of multiple log files containing rich information about what data had been transferred, when, by whom, and to where. These are available from intrusion detection logs, NetFlow data, DLP logs, firewall logs, anti-virus and malware reporting. By underutilizing this critical data, the victim company left itself wide open to several types of compromises.
True, not all organizations are capable of making sense of complex data from multiple sources. The volume and speed at which this data is produced can seem unmanageable. Also, the ability to bring together dissimilar data in a normalized and comparable manner may not be available to an organization. When this situation arises, it’s time for more advanced analysis with sophisticated mathematical support. Yes, I’m speaking of data normalization, analytics, and the application of machine learning.
Using machine learning can provide a more holistic view of the combined log data, and expose suspicious activity. In addition to revealing malicious command and control traffic, machine learning models can shine a light on who is accessing, storing and using data in “uncharacteristic” ways compared to normal and peer-group behavior. However, according to Sommer and Paxson detecting account compromise via machine learning poses some unique challenges.
First, security professionals typically expect an extremely low false positive rate from network security tools, which has given rise to the popularity of “whitelist” and “blacklist” approaches, which are too rigid to adapt to account compromise threats like this one. When scaled to an enterprise of 2,000 users, a one percent daily false alarm rate per user translates to 20 false alarms a day. Eventually, a tool that generates this many false positives will be ignored.
Second, when an account is compromised, bad logins are typically sparse and mixed with good behavior in such a way that an algorithm or human operator may miss bad behavior among the preponderance of good logins. The Expectation Maximization (EM) approach addresses this problem by treating the compromised account as a two-user model, in which sessions may either be produced by the original user or a new user. This approach causes benign sessions to fall out of the likelihood calculations, so that they do not sway a mix of good and bad sessions toward being evaluated as good overall.
In this particular Verizon incident, if the victim company had employed machine learning to analyze the data already in hand, they likely would have been alerted to several suspicious activities including who was accessing the designs, where the files were being stored, how and where they were being moved, non-typical access to sensitive data repositories, and several other possibilities.
Since most organizations have multiple security tools in place which are producing meaningful log data, applying machine learning algorithms to these information sources to profile user access and behavior is a logical next step.
We’re just scratching the surface. In my next post I’ll discuss insider threat and how machine learning can specifically assist with identifying and predicting malicious activities by an organization’s trusted users.
This article is published as part of the IDG Contributor Network. Want to Join?