• United States



Director of Data Science, ProtectWise

Artificial Intelligence: Monitoring the Monitors

Apr 12, 20175 mins

Time for a sobering moment of truth — very little of what’s being positioned as the “new” in Machine Learning is really “New Math” at all. Artificial Neural Networks (ANN) themselves are a decades-old concept, and even then the basis of that approach shares large commonalities with other statistical techniques that are themselves decades older. Consider that the most common task for most ANNs today is classification (fraud detection, threat detection, etc.), which is germane to any number of previously established supervised classification techniques such as logistic regression, support vector machines, etc., that have been circulating in the statistical realm for decades. While the computational approaches have algebraic differences, they can generally be shown to (in the limit) converge to a small set of solutions that are largely interchangeable or at least statistically indistinguishable insofar as the models are used appropriately, underlying data structures are respected by the model choice, and each model is shown the same set of information.

What actually IS new is the ability to quickly and easily perform these modeling computations at the scale to which we now can. As more software with more powerful computing engines proliferates, it becomes ever easier for a broader set of users to plug data into evermore complex algorithms (multilayer Artificial Neural Networks, complex tree-based decisioning algorithms, iterative maximization algorithms, etc) and build models. Much of that software focuses on obscuring away in black-box fashion either the parts viewed as “mundane and repetitive”, or the parts viewed as “too complicated for most users”. Automating the former certainly isn’t so bad, but obscuring the details of how data is processed can be critically dangerous. Black-boxing algorithms makes it ever easier to, either intentionally or unintentionally, skip over crucial feature design, model-validation and critical reasoning tasks that are required to make sure that these models are performing their tasks accurately and appropriately. Big data is complex; statistical modeling is (often) complex; it doesn’t stand to reason then that one can easily somehow add these two things together with black-box software and magically make things “simple”. Black-boxing also creates an inherent distrust with human users who are steeped in subject matter expertise within the fields that the models are working. Those users want to see the details of the needles in the gigantic haystack, and obscuring those away into the mathematical bowels creates suspicion and conflict. As I discussed in another blog on Artificial Intelligence, the nirvana of Machine Learning and Artificial Intelligence toolkits in InfoSec is finding productive ways to meld them with the human SMEs alongside whom they’re operating. In order to be effective at this, there needs to be an openness and transparency about how things are being modeled and calculated — “Just Trust Me On This” is never a satisfying answer, and oftentimes the modeled solutions are simply the informed starting point of a deeper investigation.

A more fundamental problem arises within the InfoSec space when we explore how supervised modeling concepts are being applied within these “automated” solutions. A majority of solutions employ unsupervised techniques as a means of grouping or classifying inputs in middle-layer input nodes, but still depend largely on supervised learning to produce final classifications to the output layer. As a common example, solutions will proxy for asset type (laptop, server, printer, etc) based on clustering of protocol-classified activity generated over a period of time, and then use the formed clusters as a means of informing cohorts that get fed into threat-specific models, which use labeled data within the classes to train feature weights. The former stage is clear enough, but leaves one to wonder how accurate the classification really is. If the goal is overall variance reduction in prediction, then maybe one doesn’t really care, but in general since solutions then expose those predicted labels, it would seem accuracy does matter. The latter exercise on the labeled data modeling immediately begs the question – who’s producing the actual labels on the data that’s being used to train the models? Security incidents are often complex and need to be reviewed carefully by analysts and responders to determine whether or not a given alert is a true- or false-positive. This is, in fact, the small cardinality labeling problem, since vetting the number of signaled detections is a MUCH smaller number of items to review than vetting the number of non-detections given that detections are a small percentage of overall traffic. Negatives (at least in some quantity) also need to be reviewed and classified as true- or false-negative to have a complete data set, and given the much larger cardinality and variation of non-detections, a larger sample is probably required to adequately capture the natural variation that exists.

The requirement is clear that someone or something has to be verifying the output of the given machine learning machinery against reality, and that it can’t be the machine itself without creating a feedback loop. If there was another independent machine that could actually know the truth we seek, then we’d be better off using that machine as the model in the first place and skipping the intermediate step. Clearly such a machine doesn’t exist — human security organizations are still the arbiters of truth and agency of remediation, so the requirement for their feedback to the models for training purposes is inescapable. Consequently, the need for informed subject-matter experts is not alleviated simply by the presence of artificial intelligence/machine learning techniques — these tools can amplify their effect via greater-than-human scale deployability, but are helpless to appropriately train themselves without a knowledgeable teacher to inform their learning.

Director of Data Science, ProtectWise

Matt is an experienced analytics professional who uses statistically guided thought processes to find optimal and actionable solutions to problems. At ProtectWise he heads up the Data Science team, which is responsible for analytics & reporting on a petabyte-scale Data Warehouse, as well as algorithmic and threat-detection research with specific focus on anomaly detection methods and threat classification models. Prior to joining ProtectWise, Matt led Data Science and Analytics at several startups and established organizations within the Ad-Tech and eCommerce arenas. Prior to entering the Data Sciences space, Matt was an Equity Analyst with Janus Capital, with primary financial research responsibilities ranging across consumer products and retail, payment processing, auto & auto parts, and several other industries. Matt was a PhD candidate and has a Master’s in Statistics from Harvard University, and graduated Magna Cum Laude with a Bachelor’s in Mathematical Social Sciences from Dartmouth College.

More from this author