How secure are your AI and machine learning projects?

Artificial intelligence and machine learning bring new vulnerabilities along with their benefits. Here's how several companies have minimized their risk.

CSO > silhouettes overlaid with a circular maze / shared perspective / empathy / collaboration
imrsquid / matejmo / Getty Images

When enterprises adopt new technology, security is often on the back burner. It can seem more important to get new products or services to customers and internal users as quickly as possible and at the lowest cost. Good security can be slow and expensive.

Artificial intelligence (AI) and machine learning (ML) offer all the same opportunities for vulnerabilities and misconfigurations as earlier technological advances, but they also have unique risks. As enterprises embark on major AI-powered digital transformations, those risks may become greater than what we've seen before.

AI and ML require more data, and more complex data, than other technologies. The algorithms used have been developed by mathematicians and data scientists and come out of research projects. Meanwhile, the volume and processing requirements mean that the workloads are typically handled by cloud platforms, which add yet another level of complexity and vulnerability.

High data demands leaves much unencrypted

AI and ML systems require three sets of data. First, training data, so that the company can build a predictive model. Second, testing data, to find out how well the model works. Finally, live transactional or operational data, for when the model is put to work.

That creates two different problems, each with its own security implications. First, the training data that data scientists collect is typically in cleartext. Working with anonymized or tokenized data makes it harder to build the model. Data scientists typically don't have that kind of data security expertise. When the model has proven itself and is moved to the operational side, it still expects to ingest plain-text data.

That's a major security risk. For Edgewise Networks, encrypting all the data from the start had a cost. "But we knew we'd have to make the investment at the beginning, because we didn't want to be a cybersecurity company leaking PII in the cloud," says John O'Neil, the company's chief data scientist. "We had customers giving us network information, so we started with the idea that any information at rest had to be encrypted."

The second security risk is data that isn't so obviously valuable. While real, live transactional or operational data is clearly a valuable corporate asset that enterprises will try to protect, it can be easy to overlook the pools of training and testing data that also contains sensitive information.

It gets worse. AI systems don't just want more data. They also want different kinds of data, contextualized data, the kind of data that can dramatically expand a company's exposure risk.

Say, for example, an insurance company wants to get a better handle on the driving habits of its customers. Data sets are available on the market today that offer shopping data, driving data, location data, and much, much more that can easily be cross-correlated and matched up to customer accounts. That new data set can be exponentially richer than the one the company started with, more attractive to hackers, and more devastating to the company's reputation if it is breached.

One company that has a lot of data to protect is Box, the online file sharing platform. Box uses AI to extract metadata and improve search, classification, and other capabilities. "For example, we can extract terms, renewals and pricing information from contracts," says Lakshmi Hanspal, CISO at Box. "Most of our customers are coming from an era where the classification of their content is either user-defined classification or has been completely ignored. They're sitting on mountains of data that could be useful for digital transformation -- if the content is classified, self-aware, without waiting for human action."

Protecting data is one of the key pillars for Box, she says, and the same data protection standards are applied to AI systems, including training data. "At Box, we believe that it is trust we build, trust we sell, and trust we maintain," she says. "We truly believe that this needs to be bolted into the offerings we provide to our partners and customers, not bolted on."

That means that all systems, including new AI-powered projects, are built around core data security principles, including encryption, logging, monitoring, authentication and access controls. "Digital trust is innate to our platform, and we operationalize it," she says.

Do you know what your algorithms are doing?

At Box, the company has a secure development process in place for both traditional code and the new AI and ML-powered systems. "We're aligned with the ISO industry standards on developing secure products," says Hanspal. "Security by design is built in, and there are checks and balances in place, including penetration testing and red teaming. This is a standard process, and AI and ML projects are no different."

That's not true for all companies. Only a quarter of enterprises today bake security in right from the start, according to David Linthicum, chief cloud strategy officer at Deloitte Consulting LLP. The other 75% are adding it on after the fact. "It's possible to do that," he says, "but the amount of work is going to be one and a half times more than if you built it in systematically, and it's not going to be as secure as it would be if you have designed security into the system."

AI and ML algorithms have been around for a while -- in research labs. Mathematicians and data scientists typically don't worry about potential vulnerabilities when writing code. When enterprises build AI systems, they'll draw on the available open-source algorithms, use commercial "black box" AI systems, or build their own from scratch.

With the open-source code, there's the possibility that attackers have slipped in malicious code or the code includes vulnerabilities or vulnerable dependencies. Proprietary commercial systems will also use that open source code, plus new code that enterprise customers usually aren't able to look at.

Even when companies hire PhDs to create their AI and ML systems, they usually wind up being a combination of open source libraries and newly written code created by people who aren't security engineers. Plus, there are no standard best practices for writing secure AI algorithms, and given the shortage of security experts and the shortage of data scientists, people who are experts in both are even in shorter supply.

Exabeam uses ML models to detect cybersecurity threats in the log data of its enterprise customers and the algorithms include both proprietary and off-the-shelf components, says Anu Yamunan, VP of product and research at the company. "We want to make sure there are no vulnerabilities in those tools," he says. That means vulnerability scans and third-party penetration tests.

Need to secure more than just algorithms

Securing AI and ML systems is about more than just securing the algorithms themselves. An AI system isn't just a natural language processing engine or just a classification algorithm or just a neural network. Even if those pieces are completely secure, the system still must interact with users and back-end platforms.

Is the user interface resilient against injection attacks? Does the system use strong authentication and the principles of least privilege? Are the connections to the back-end databases secure? What about the connections to third-party data sources?

Mature enterprises will have a software development process that includes security right from the start as well as static and dynamic code reviews and other testing, but AI systems are often built outside that process, in skunk labs and pilot projects, so all those steps are skipped. "Data scientists are great at figuring out how we approach the ML problems, but they're not security experts," says Exabeam's Yamunan. "It's important to have the security experts and the data scientists sitting together, working on the project together."

AI and ML development needs to be aligned with a best practice framework for information security, says Rob McDonald, VP of product management at Virtru, a cybersecurity firm. "You're going to have to include security in this process," he says. "If not, you're setting yourself up for problems -- probably that could have been resolved in the beginning of the design process if you have security oversight in place."

AI algorithms can create bias

When the AI and ML systems are used for enterprise security -- for user behavior analytics, to monitor network traffic, or to check for data exfiltration, for example -- there's another area that can create problems: bias.

Enterprises are already dealing with algorithms creating ethical problems for their company, such as when facial recognition or recruitment platforms discriminate against women or minorities. When bias creeps into algorithms, it can also create compliance problems, or, in the case of self-driving cars and medical applications, they can kill people.

Biased algorithms can also weaken a company's cybersecurity posture, says Deloitte's Linthicum. This problem requires careful attention to training data sets and continual testing and validation after the initial training.

That's a big unknown when enterprises use AI and ML-powered security systems from outside vendors. "If you're not creating the model yourself, then it could have all sorts of issues in it that you're not aware of," Linthicum says. "You have to audit everything and make sure you understand everything. If you get anything pre-baked, you should view it with suspicion unless proven otherwise."

That attitude is especially important if the results are used to prioritize cybersecurity responses, and, even more so when the responses are automated. "That means that they're going to have a force multiplier in their ability to hurt us," Linthicum says.

That bias can be accidental, or it could be caused by a hacker. "How do you know that the attackers haven't introduced false training data to manipulate the algorithms?" asks Brian Johnson, CEO and co-founder at DivvyCloud, a cloud security vendor. "You can potentially retrain the algorithms to not pay attention to the bad things you want to do. If you bet the farm on ML, you might not notice that it's not picking up those things."

There haven't been any public, documented cases yet of attackers deliberately manipulating AI training data, says Ameya Talwalkar, chief product officer and co-founder at Cequence Security, but it's the right time for companies to start thinking about it. "Otherwise, you might have incidents that take people’s lives, such as with driverless cars," he says. "It's a threat that needs to be taken seriously."

The future of AI is cloudy

AI and ML systems require lots of data, complex algorithms, and powerful processors that can scale up when needed. All the major cloud vendors are falling over themselves to offer data science platforms that have everything in one convenient place. That means that data scientists don't need to wait for IT to provision servers for them. They can just go online, fill out a couple of forms, and they're in business.

"That's become a laser focus for many CSOs and CISOs, who have no handle on business units using AI systems and strategies and moving workloads to the cloud," says Suni Munshani, CEO at Protegrity, a cybersecurity firm. "There are lots of cowboys doing projects aggressively, but there are very few controls. The whole idea of enterprise control is just a delusion at this stage of the game."

The cloud vendors promise robust security, and everything looks good to the untrained eye. In general, cloud systems can often be more secure than home-grown, on-premises alternatives, says Bryan Becker, security researcher at WhiteHat Security.

But configuring these systems can be tricky, as all the recent exposed Amazon S3 buckets have demonstrated. Even CapitalOne had problems configuring their Amazon web application firewalls. "You can have secure infrastructure, but configure it insecurely," says Becker. "That's probably the number one security concern."

These projects then turn into operational systems, and as they scale up, the configuration issues multiply. With the newest services, centralized, automated configuration and security management dashboards may not be available, and companies must either write their own or wait for a vendor to step up and fill the gap.

1 2 Page 1
Page 1 of 2
Get the best of CSO ... delivered. Sign up for our FREE email newsletters!