• United States



Editor in Chief

Using people to fight cyber attacks is like bringing a knife to a gunfight

Apr 22, 201520 mins
Data and Information Security

It’s time to automate security response, says the CSO of a $1.6b company, who swears by a new tool he has deployed

Golan Ben-Oni, CSO and SVP of Network Architecture at IDT Corp., is responsible for protecting the infrastructure of the diversified company’s telecommunications, payment, energy and oil businesses, which employee 1,700 and include 12,000 endpoints worldwide. Automation is key, he says, because the attackers have upped their game. Ben-Oni shared the story with Network World Editor in Chief John Dix.

Let’s start with a thumbnail description of your company.

IDT and its affiliated firms are involved in telecom, payment services, oil exploration, energy supply and entertainment. We started in telecom, and that’s what IDT is known for, but we have since entered the vertical markets of energy and oil, and banking and finance. We’re doing shale oil in Israel specifically, and on the finance side we own banks in Europe, so we see our share of state-sponsored threats. The energy and oil business was recently spun off as a separate company, but it’s in our building and some of us have shared responsibilities. I’m responsible for the architecture of their security environment, although we’re growing and the companies are getting bigger independently.

So with energy exploration and banking concerns you’re a rich target?

Yes, that’s correct. And the reality is our adversaries are numerous and quite good, which brought us to the need for speed and for automation because I look at it in the context of a battlefield. In the beginning, adversaries were lazy because pretty much anything they did worked. They didn’t have to try very hard, and most of the time they weren’t even noticed. They could live inside an organization for weeks, months or years and just collect more and more intelligence.

Our first order of business was just gaining visibility. What that meant was we had to find best-of-breed vendors, and sometimes two or three that do exactly the same thing. I’ll give you an example. Years ago we brought in FireEye, which told us a bunch of stuff, but of course the adversaries started figuring out ways of defeating specific implementations, so we took a cue from the NSA which believes you should deploy three of everything, so that’s what we did.

We deployed FireEye, Palo Alto Networks and Fidelis Network Systems, just on the network piece alone. Then we did the same thing on other components of the environment, like the endpoint and the user analytics space.

But we ended up with a lot of product and very little interoperability, so we inserted people in between to deal with the alerts and events and trying to glue it all together into a cohesive story. That took a lot of time from an incident response perspective, but we did gain visibility. Too much visibility, though, can harm you, especially if it’s repetitive.

One of the things we had to work on was getting everything into one place. Traditionally people use SIEM tools for this and we’ve tried many. We started with RSA, moved to Nitro and now we’re very heavily focused on Splunk. One of the key differentiators with Splunk is it’s fast and digests all kinds of information. You don’t have to spend a lot of time in professional services getting it to digest data.

In 2013 my number one pain point was, “How do we gather all this data and do something about it without having to get a person involved?” Many of the alerts were clear: This machine is infected. There’s not a lot of thinking we need to do. We have to investigate the system, pull off forensic data, move it into a remediation network where it won’t harm other components of the environment, then wipe the system and get it back to the user because they’ve got business to do.

So in 2013 we did a very simple use case with Splunk with some help from our friends at Palo Alto, and it was the beginning of what became our automated incident response methodology. It was still in its infancy, but we got a lot of positive feedback from other big organizations that really wanted to do it.

But security is not our core business. We’d rather share our efforts and designs and strategy with a vendor that can go implement it. We know what to tell vendors to do, but we would rather they go and produce something that’s supportable and generally available and so on.

Had you implemented your own tools?

Yes, in the beginning we even had to integrate tools that didn’t have APIs. We were doing crazy things like logging into databases and inserting data in ways that the product was never meant to do, just to get things working. We did it with AccessData, we did it with Mandiant, and Mandiant in particular was saying, “It’s interesting, but our customers aren’t interested in automation.”

I am always on the search for vendors that get the vision, understand its importance irrespective of what their other customers are saying, because we’re pretty sure where things are headed and it goes back to this battle we’re dealing with. Although attackers were initially lazy, they have since started using automated tools. So if you’ve got adversaries with automated tools on the one side and we’re running around on the other side with sneakers on our feet, that’s just not going to work. It’s not a fair battle. It’s very hard to deal with an army of automated robots.

Is there a way of putting the size of the problem in perspective?

We’re looking at everything that goes on in the network, everything that happens at the OS level, any kind of changes that happen in the file system, if there are new files that get dropped or files that get loaded, if there are mobility events. We essentially stream all of this in real time into Splunk, so it very rapidly becomes a big data problem. I think we’re indexing about 500GB a day, but we’re scheduled to go up to 5TB once I get to the level of logging I’d like.

Five terabytes per day?

Yeah. Keep in mind we’re logging literally everything that’s going. We need complete visibility, and just the Palo Alto firewalls alone contribute 200GB a day. So indicators of compromise get fed in from lots of places, not even including the third-party IOC feeds we get about things happening in other people’s networks.

All of this comes in from the network side, from the endpoint side, from the user analytics side, from the threat side, so the question is, how do you get all this stuff to work together? In the beginning we had to integrate the tools ourselves, and I realized this is not the kind of work we want to do. I would rather present the problem to an organization that can run with it.

Hexadite emerged, and the interesting thing about Hexadite is they understand the kinds of issues we’re dealing with because they come out of the highly targeted state of Israel with people that used to work in intelligence. They took this thing by the horns and said, “What are the use cases?”

We had already solved the basic use case of what happens when an active threat comes in that we know is bad: We go through this remediation cycle — we get it off the network, pull the forensic data, wipe the machine and restore it. The harder problem is what happens when you’ve got an indicator and don’t know what it means, or maybe it’s a weak indicator.

Maybe you get an indication that something bad flowed into your organization but you don’t know if the endpoint executed it or not. Maybe you saw a credential being misused or that someone tried to log in from a system they don’t normally log in from, but it was very low activity, maybe just once or twice. That’s not the kind of thing a SIEM is necessarily going to bubble up, especially if there are millions of events happening.

When we first turned some of these systems on we were getting 15,000 events a day. That’s not something a human can deal with, so we had to tune it. The point is that everything needs to be investigated, absolutely everything, even the things that may only happen once or twice. They may be your most important indicators but, because you’re trying to do things with people, you’ll never get around to them or won’t even notice them in a rash of events.

It’s common, after all, for attackers to run interference. There may be 15 people working on capturing an organization and 10 of them just generating noise to distract your SOC from what they really need to be paying attention to. So this is where the need arises. You have to investigate everything and, if you’re going to use people, you’re never going to get it done.

And Hexadite helps how?

So what Hexadite will do for us is sense something has been triggered. Say we get something from WildFire that says a malicious binary floated into the organization, so an a automated investigation is kicked off: Go look at that machine quickly. Find out whether it executed. Find out whether or not there are other things on the machine that shouldn’t be there. If we determine that it was adware or not as malicious as we thought, then we just clean off the system and return it to service.

That whole process takes about a minute. In the traditional incident response mode it would probably take 10 to 15 minutes for the correlation rules to kick in in our SIEM, then another eight or nine minutes for an operator to see the alert and try to understand some contextual information before picking up the phone to call the network team or the systems team or whomever to start to deal with isolation or investigation.

As you step through the manual process you go from minutes to hours, days even. A standard investigation done with a person on just one machine, that’s going to take hours. What happens when there are 50 machines in your organization that just got targeted? What are you going to do when the malware is polymorphic and it looks different every time? These are real live challenges that we were faced with and we realized we couldn’t throw people at this problem. That’s not possible; hence, the strong argument for automation.

Sometimes the indicators aren’t clear and they’re just a hint of something, so we’ve got to go into the systems and collect more contextual information about what happened. Maybe the alert isn’t a big deal so we’ll just shift that system into temporary remediation, go do an investigation and return it to service.

But what if it was serious? We’ll start to see things about the way that machine has acted. Maybe it started communicating to things it doesn’t normally communicate with and we’ll need to pull those IPs and go investigate the secondary systems. Or maybe we’ll see a user ID that’s starting to be used unnecessarily or in a way that isn’t normal, so we’ll need to investigate the machines that user ID may have touched.

This is all possible through automation, whereas in the past, we were doing this with people and people are people. You may have a good guy on staff but he may be too busy to get to everything. You may have a new guy on staff, so it’s not consistent either. Certainly they can’t be as consistent and they certainly can’t investigate everything.

The basic idea is to automate what you can, to enlist the services of CPUs that can handle billions of operations per second, and free up people with the neurons. Then you end up with an operations center that is really world class. That’s the goal in all of this.

Where does Hexadite plug in?

Hexadite is software that can be run on an appliance — we do everything in virtual systems – and it’s reading data out of Splunk, looking for specific things. So Splunk can receive an alert from one of our security tools and initiate an automated investigation based on those alerts. The alert may come from WildFire, so Splunk and WildFire will be combined on that, then Hexadite will come in on the containment side because we’ll initiate a change policy on the firewall to say “These IPs are implicated, they talk to no one.”

Hexadite then starts pivoting, looking for additional data that may be correlated to that event, looking for additional hosts that need to be investigated. It goes out and runs and starts to generate information based on what it finds. For example, it can install a micro applet on an endpoint if it needs to analyze that system. So whereas you’re talking hours or days traditionally, now within a minute or a minute and a half we’ve accomplished something.

What types of things will you allow it to do on its own versus stuff that you wouldn’t allow it to do?

That’s really a policy decision. The key is to develop an asset database and then you can have different policies for different assets. For example, if it’s a laptop that belongs to a casual user, you’re going to have a set of policies about that. You may have a separate set of policies about a file server. If there’s only one file server and you take that out, you may affect 600 users, but if there are two and one backs up the other, then you can feel confident about remediating that system.

Here it’s important to have a contextual database, and the way we initially structured it is, at least for Windows systems within the Active Directory infrastructure, we classify servers and hosts by group and then you can have policies that are deployed differently.

Investigation, collecting data, doesn’t harm our systems. We only push automated remediation once we know, with a high degree of confidence, there’s a problem. In the case of adware, we’ll just go in and kill the process, remove the adware and notify AV of hashes so if they appear on other machines they’ll get automatically quarantined. That’s easy. No risk. But if a system gets infected with something serious we have no choice. We’ve got to run the IR investigation and the reality is we have to wipe that system because there’s no guarantee that deleting the processes and quarantining the files is all you’ve got to do.

If you’re dealing with servers in a data center environment that supports virtualization, you have server templates that let you deploy in minutes. So if a server gets affected and you don’t have a standby, you can still insert a new one within a few minutes.

In some sense, servers are easier to deal with because of their API-driven architectures. Endpoints are a bit more difficult because it takes about seven minutes to grab the memory and, depending on how much disk you have to copy, it could be an hour, which we try to do in situ basically, without requiring the system to be moved. We just inform the user that their system is under remediation and start that process in the background.

When you first turned on automated response were there many problems?

There are always early problems. One of the biggest issues is mischaracterizing an event as serious when it’s not. This is one of my biggest beefs with vendors that deliver threat feeds. There’s no contextual data; it implicates things that aren’t bad. So you have to be careful with false implication.

What you do is start slow. Obviously, overreacting is bad and underreacting is bad. You have to find a happy medium, but we’ve been kicking the tires on this for a long enough time that we’ve gotten iterative and we’ve gotten better at it. The system learns, you learn, the system learns, and that’s the important thing.

Have you been able to turn up the number of things that can be responded to automatically?

Yes. In the beginning we worked with 20 systems, then went to 50, to 500, and then we deployed across the organization. And of course we started in the areas that we felt were most heavily targeted, the ones that were the highest risk.

Sometimes the best thing to do is shut off the system if there’s a problem. And that’s okay. We started first on the enterprise user community, where we weren’t worried about impacting revenue-generating systems. The user community doesn’t necessarily have the safest Internet browsing habits, after all. They’re doing stuff they don’t know they shouldn’t be doing, or maybe their kids grabbed their computer to watch some TV program and their session ends up at some bad website.

We felt early on we could do a lot more with workstations. Worst case, we’ll have an unhappy user for a little while. But we learned a lot. The server environment is much more static. It doesn’t change as often, doesn’t have a lot of users, so the environments are more stable and that way we can say, “Okay, these are assets we may want to back up with a secondary DR unit and if we ever have a problem we can pull them out of service.”

The number of investigations that get automated are orders of magnitude higher in the user environment. That’s just the reality. The bad guys go to LinkedIn or Facebook and find out who they want to target, they send an email, sometimes to their private email accounts so we can’t even scan it. It’s awful.

And when one system gets infected in the organization it becomes a risk to everyone else. The corporate environment is actually a lot riskier than your home environment because you may have 1,000 or 2,000 people sitting in a building sharing data. The analogy is like sharing needles. It’s almost healthier for people to be working at home where they don’t have systems sitting side by side that could be used for lateral movement.

We actually re-architected our corporate environment to combat against lateral movement,

to be more like the home environment so that laptop can only communicate with a few places. A laptop, for example, cannot communicate with another laptop. It’s not allowed by policy. There’s no business reason to allow it. So we started deploying these policies and we look for attempts to communicate laterally and that can be an indication of an attack vector. If it is only one or two events in 500GBs of daily log data, no human could possibly find that needle in the haystack. That’s why automation is so important.

So in action an alert shows up in Splunk, then Palo Alto is used to isolate the host, the user is notified their host is being investigated, and files found on that host might get sent to WildFire for additional investigation, then we have to search for extra logs on the host, search for other forms of malicious activity and, if we find any, terminate the processes and quarantine the files. This is all within seconds.

Then there are logs in Splunk we have to look at to see whether this host communicated with anything else, whether the user ID was used anywhere else, whether there were any open processes. And we take all of that and analyze it all through automation. That’s all still done in minutes. That’s the power of this thing. It’s nine hours versus a minute and a half, and believe me, we’re trying to get the time down. I want to get it down to under a minute, get it down to 30 seconds. It’s an iterative process.

Any way to quantify how much bad stuff Hexadite is stopping in its tracks?

It’s doing hundreds of investigations a day, and most are not the most malicious malware, but

it can be something really bad. We see really bad things four or five times a year. It’s like a fire. It starts with a spark, so if you remove the fuel immediately, that fire is not going to go anywhere. But if you just let it burn, you could have a much bigger problem on your hands.

We’re leveling the playing field to something that’s more fair, which is automation versus automation. I’m feeling better about it. I still sleep six hours a night, sometimes less, but at least I know that things are happening. I don’t have to light a fire under people, we can feel more confident that the basics are being done all the time, and that’s the key.

Was it hard to deploy Hexadite?

The hardest part was finding them. I was working on this before I even knew they were around. Had I heard about them earlier, it would have saved me time. They know what they’re looking at. Coming from Israeli intelligence, they’ve seen everything. A lot of the rules that are important to us came predefined and we did the low-hanging fruit and then we iterated the harder things, like what happens if we’re not sure about an event? Developing those strategies and basically defining our policy is just sitting down and doing configuration. But the nuts and bolts, they have that down to a T. They really understand APIs and interfacing with new vendors. I told them we need to work with this vendor and boom, they’re off, they’re reading the APIs, they’re getting the stuff working.

I’m not doing development anymore, and even the configuration just starts with a conversation. What would you like to do? Well, this is what I’d like to do. They go and do it. The other thing is we get meaningful intelligence reports out of these things. We don’t get a 500-page splash. It’s meaningful. “This was harmful. This wasn’t harmful. Here’s how many events we had and what we did about them.” That’s all I want to know at the end of the day.

Anything Hexadite doesn’t do that you want to see them address?

There’s a lot we can do but you can’t get to that point until you’ve done the basics. We’re working on privileged identity management containment, and then we’re going to be working on the Microsoft Azure Cloud because we’ve got basic stuff that we’re doing there, so it’s making sure that whatever you do works in multiple environments and not just the Windows desktop environment.