Automating incident response lets IDT take battle to the enemy

By automating the incident response process, IDT was able to reduce the time before the infection was quarantined, shorten the remediation cycle, reduce investigation time, and free up security staff to go after the bad guys themselves

quarantine sign
musicalwds (Creative Commons BY or BY-SA)

Two years ago, attackers had Newark-based telecom and payments provider IDT Corp. pinned down.

Security staffers had their hands full dealing with a constant inflow of attacks against the company's infrastructure.

Sorting out real attacks from false positives, cleaning up malware, and ensuring that infections didn't spread could take hours -- or longer -- for a single incident. Meanwhile, every additional minute that an infected machine stayed on the network was that much more opportunity for the attackers to bury themselves deep or to make lateral jumps to other machines.

By automating the incident response process, IDT was able to reduce the time before the infection was quarantined, shorten the remediation cycle, reduce investigation time, and free up security staff to go after the bad guys themselves.

Quicker quarantine

At the end of 2013, it took about 30 minutes to isolate an infected device and remove it from the company's network, said Golan Ben-Oni, IDT's CSO and senior vice president of network architecture.

"Because of the danger of what happens when a compromised asset sits on the network, we wanted that time to be reduced from about 30 minutes to just seconds," he said.

To do this, the company used the application programming interfaces from Palo Alto Networks, its firewall vendor, and Splunk, its big data analytics platform.

Previously, a WildFire alert would be sent to the company's security information and event management system, at which point a security professional would manually isolate the suspicious host and start looking for the downloaded malware file.

Now, the WildFire alert is delivered to Splunk in about one second. Within seven seconds, Palo Alto isolates the device, the user gets an alert that their machine is being investigated, and the WildFire alert is sent on for analysis.

"We might get an alert from a user analytics platform that a user ID was being used improperly, or that malware was detected on an end user device," said Ben-Oni.

Ready remediation

Then the company turned to the remediation process which previously took more than eight hours of manual labor.

Those alerts that scored high and were most likely to be real and not false positive are now handled automatically.

"We locate all the newly downloaded files and initiate forensics on memory and disk to try to identify more information about the event," he said. "Once we've collected all this, we'll go ahead and image that system to our forensic capture platform, and re-image it, bringing that system to a golden image."

It takes about five minutes to collect the initial round of data, he said, then another 30 minutes to collect all the disk information for deeper forensic analysis.

Then the computer, and all user files, are restored and a user can get back to work within about an hour.

The process takes longer if a user is working remotely and doesn't have access to the company's 10 Gigabit network.

"So, for the mobile workforce, we actually do something else," said Ben-Oni. "We'll direct them to a workspace in the cloud."

After the user is back at work, the machine is watched for the next 48 hours.

"We make sure that that host cannot execute any code that we did not install -- a white list -- because there's always an opportunity that the host will get reinfected after you reimage it and reinstall user files back on the system," he said.

For production servers, the process is even faster. If there's a secondary system in the environment, the infected server is simply taken offline and the backup goes to work, with no impact on delivered services.

"In the production environment, we have automation tools on Amazon and VMware to spin up new hosts or change the load balance configuration to direct traffic to backups or hot standbys," he said.

Incident investigation

Each high-priority, high-fidelity alert processed automatically would save an employee up to nine hours of work, or more.

Time that they could now spend investigating alerts that require human investigation.

There were plenty of these alerts coming into IDT every day, alerts that would not be normally considered high-fidelity.

In the past, most of these alerts would have been ignored because there was simply not enough time to handle them.

Over the past couple of years, there were plenty of news headlines about what happens then.

"With many of the data breaches -- like Target, Home Depot and others -- security teams were sent alerts but the teams were unable to determine which were the highest risk," said Muddu Sudhakar, co-founder and CEO at security vendor Caspida.

This is the kind of thing that keeps IDT's Ben-Oni up at night.

Golan Ben-Oni, IDT's CSO and senior vice president of network architecture

"When we started to manually investigate them, it became clear that many of these alerts were actually very serious," Ben-Oni said. "Are we seeing everything we need to see? And once we do see things, are we reacting to them appropriately? For us, reacting to them appropriately means reacting to every event, and determining if they are significant."

But even with automated remediation, IDT still didn't have enough resources to investigate 80 to 90 percent of all the alerts coming in.

And the investigations that were involved were very time-consuming. Analysts had to pivot between different systems, look at the context of what was happening on the machine and networks, and sandboxing and analyzing code.

1 2 Page 1
Page 1 of 2
7 hot cybersecurity trends (and 2 going cold)