• United States



Andrada Fiscutean
Freelance writer

Why you can’t trust AI-generated autocomplete code to be secure

Mar 15, 20229 mins
Artificial IntelligenceDevSecOps

Artificial intelligence-powered tools such as GitHub Pilot and Tabnine offer developers autocomplete suggestions that help them write code faster. How do they ensure this code is secure?

Artificial intelligence and digital identity

When GitHub launched the code autocomplete tool Copilot in June 2021, many developers were in awe, saying it reads their minds and helps them write code faster. Copilot looks at the variable names and comments someone writes and suggests what should come next. It provides lines of code or even entire functions the developer might not know how to write.

However, developers using unknown suggestions without verifying them can lead to security weaknesses. Researchers at the New York University’s Tandon School of Engineering put Copilot to the test and saw that 40% of the code it generated in security-relevant contexts had vulnerabilities.

“Copilot’s response to our scenarios is mixed from a security standpoint, given the large number of generated vulnerabilities,” the researchers wrote in a paper. They checked the code using GitHub’s CodeQL, which automatically looks for known weaknesses, and found that developers often get SQL-injection vulnerabilities or flaws included on the 2021 CWE Top 25 Most Dangerous Software Weaknesses list. Also, when it comes to domain-specific languages, such as Verilog, it struggles to generate code that’s “syntactically correct and meaningful.”

Many of these issues stem from how Copilot was built. First, the model was trained on code posted by anyone on GitHub, and a large proportion of it has not been vetted. Second, open-source repositories might contain a lot of repeating code patterns that do not implement sufficient bounds and checks on input and behavior. Copilot ends up suggesting such patterns on the assumption that the more frequent a pattern, the more widely used and, thus, secure it is. The third, the code it produces is not compiled and not checked for potential security issues.

Researchers also say that some secrets inadvertently left in someone’s repository could potentially surface in code that’s automatically suggested to another person, so that there could also be ethical implications or intellectual property issues.

As code autocompletion tools become more popular, such issues start to have implications for both individual developers and organizations. Using these tools securely and finding ways to build new ones that offer safer code could revolutionize programming. Developers and security experts say the industry should pay attention to a couple of things, ranging from training these models on the right set of data to following best practices for writing code and putting security checks into place.

“I think there’s certainly a lot of good to be had,” said Hammond Pearce, research assistant professor at NYU Tandon School of Engineering, who co-authored the paper.

How AI-based autocompletion tools generate code

GitHub Copilot was developed on top of a model called Codex, developed by OpenAI. Codex is a descendant of GPT-3 and uses a large neural network that predicts human-like text or snippets of code. “There is a difference between GitHub Copilot and OpenAI Codex,” says Pearce. “We’ve used both models quite extensively, and we’ve observed that there’s definitely some additional functionality in there.” (GitHub did not say what this additional functionality is. It declined a request from CSO Online to talk about the security of AI-generated code.)

According to the NYU researchers, some bits of suggestions these two models offer are taken from code already written by random developers who put their work on GitHub. However, the models often come up with code that has never been seen before. “[Copilot] is very good at coming up with novel code that has not necessarily been seen before. But will it reproduce patterns? Yes, it will,” says Pearce.

A big issue with AI-powered code autocompletion tools like Copilot, Tabnine, Debuild, or AlphaCode is that they lack context, which can be problematic from a security standpoint, says Brendan Dolan-Gavitt, a member of NYU Center for Cybersecurity and an assistant professor of computer science and engineering at NYU Tandon. “These models don’t understand code in the sense of knowing this is a variable, and this thing is an integer,” he says. “They understand it at the level of this text I have seen is usually followed by this other text. It doesn’t have a notion of what is good code.”

Training the programming models

While Copilot is a universal model and can speak any programming language, it tends to be more accurate with Python, JavaScript, TypeScript, Ruby and Go, because more code has been written in these languages and placed on GitHub. The more data there is to train an AI, the better the predictions.

“GitHub has a massive amount of code,” says Brandon Jung, vice president of ecosystem and business development at Tabnine. “There’s a lot of data they can shove into a model. The theory goes, if you get a big enough model and throw enough [computing power] at it, you can have one model to rule them all.”

The problem is not with quantity but with quality. “From a statistical point of view, there are more bad code examples than there are good ones even in popular repositories,” says Lucian Nitescu, red team tech lead at Bit Sentinel.

Even good code can lead to bad outcomes because best practices change over time. According to the NYU researchers, Copilot frequently tried to generate code that was acceptable many years ago but is now considered insecure. For instance, it suggested developers use the MD5 hashing algorithm.

“When [developers] wrote that code, it was good code,” says Pearce, “but now, the world moved on. We learned new strategies for doing things; we learned new strategies for attacking things; we have more powerful machines. As a result, that code is no longer secure from a best practice point of view.”

Such wrong suggestions are imminent when AI-powered code autocomplete tools are trained on random code rather than code that is known to follow current best practices and has been audited for security flaws, and organizations should be aware of that, says Jung.

Tabnine’s approach is a little bit different than that of Copilot. Instead of building a big universal model operating across several programming languages, it makes smaller models that are customized to meet teams’ needs. Instead of a Swiss Army knife, it offers clients personalized Phillips head screwdrivers if screwdrivers are all someone needs. According to Jung, it means that the training set, while smaller than that of GitHub, can be more targeted, more up-to-date, and better written. “So, suggestions will be better,” he argues.

Another advantage to smaller models is that they can work anywhere and require less computing power than large ones. They can run on a developer’s laptop without the need to upload code to the cloud, which might make professionals or companies uncomfortable.

Jung says Tabnine updates its universal model on a quarterly basis and individual models as often as needed. It’s not just models that need training. For the human-AI pair to write good code, the person’s access to knowledge should also be facilitated.

Training devs to use AI-generated code

Often, young developers are more likely to accept suggestions coming from an AI, especially when they lack experience in a particular area. The goal should be to also help them learn, not just to help them finish their job quickly. Jung argues that the best way to do that is by offering small chunks of code rather than large snippets.

“The mental overhead for a developer is much, much higher on a huge snippet,” Jung says. “For the industry, it’s really bad because people aren’t learning to actually code. They don’t understand how it works.”

Large snippets look seductive, but they can cause harm in the long run because they lack any explanation or context around them. It can be worse than copying code from Stack Overflow, he argues. “In Stack Overflow, there are comments,” Jung says. “You may not read it, although you should, but at least there’s a chance that people are learning why they would implement something, and how it was implemented.”

Tabnine experimented with large snippets of code a few years ago but ultimately decided against that. Instead, it offers developers only small chunks, which can be better written by the AI and better understood by devs, which ultimately leads to code that’s more likely to follow best practices and be secure.

To educate devs, the company also facilitated the transfer of knowledge from senior to junior members of a team. Every time senior devs write code, that team’s personalized model evolves. It looks at how seniors write code, making recommendations to help junior developers. “I’m not going to pretend it’s as good as someone sitting next to you and helping you, because it’s not,” Jung says, “but it’s dramatically better than where we’ve been in terms of security of code, quality of code, and just getting a dev much faster up to speed in a codebase.”

Security checks on AI-generated code

AI-powered code autocomplete tools have improved over the past years, but there’s still one important issue they lack. Tools like Copilot or Tabnine don’t “look at the state of security of the code they examine, nor the code they produce,” says Greg Young, vice president of cybersecurity at Trend Micro.

The task of checking the code falls in the hands of the developers. “The provided code should not be considered fully secure unless we perform an exhaustive security analysis on every function and its uses,” says Nitescu. He adds that developers should perform “automated and manual security code reviews and follow the domain best practices defined within any specific language documentation before using generated code in production environments.”

Nitescu recommends the following standards and documentation:

There are also automatic tools that check the code for potential security issues. One, CodeQL, which was used by the NYU team when testing Copilot offered, looks for a wide range of security weaknesses and can identify bugs.

Both Tabnine and Copilot advise developers to use such tools to check their work. Still, they don’t incorporate security checks, although security experts say that such a feature would be nice to have. “It would be great if Copilot today had AI to improve security,” says Young, “but maybe in the future.”

Eventually, AI-powered code generators could evolve to filter out bugs and even suggest secure implementations, but there’s still a long way ahead. For now, developers and code maintainers are responsible for the security of the code they submit.