• United States



Shweta Sharma
Senior Writer

Dependencies in LLM packages open apps to vulnerabilities: Report

Jul 19, 20235 mins
DevSecOpsThreat and Vulnerability Management

Open-source packages with large language model (LLM) capabilities have many dependencies that make calls to security-sensitive APIs, according to a new Endor Labs report.

Credit: getty

As applications increasingly include prepackaged software components that take advantage of generative AI capabilities based on large language models (LLMs), the danger of vulnerabilities making it into production looms greater than ever, according to research by cybersecurity company Endor Labs.

A new report dubbed "State of Dependency Management 2023" notes that even though application developers use only a fraction of these packages -- components such as libraries and modules designed for easy use and installation into software programs -- they have  numerous dependencies and may make risky API calls.

"LLMs are a great support for many day-to-day programming tasks. However, it is important for developers to verify the output provided by LLMs before including it in production code," said Henrik Plate, security researcher at Endor Labs and the author of the report.

For the research, Endor Labs used the Census II data set from the Linux Foundation and Harvard, the company’s in-house API categories and vulnerability database, open source Github repositories, and packages published in the npm and PyPI package repositories.

Onslaught of LLM/AI enabled packages

While tracking newly published packages uploaded to the npm and PyPI repositories that made calls to the OpenAI API, Endor Labs found, since the launch of ChatGPT's API in January 2023, more than 636 new PyPI and npm packages created to use the API. Additionally, 276 already existing packages added support for ChatGPT API.

The research also noted that this is just a subset of the total number of ChatGPT-enabled packages, as the number of private projects experimenting with LLMs is even bigger.

When Github repositories for its Top 100 AI projects were scanned, they were found to reference, on average, 208 direct and transitive dependencies. Eleven percent of the projects were found relying on 500 plus dependencies.

Fifteen percent of these Github repositories contain 10 or more known vulnerabilities. The package distributed by Hugging Face Transformers (the architecture that ChatGPT is based on) has over 200 dependencies, which include four known vulnerabilities.

Dependencies make calls to security-sensitive APIs

Fifty-five percent of applications tracked by Endor make calls to security-sensitive APIs -- programming interfaces that link to critical resources which, if compromised, could affect the security of an asset. That number grows to 95%, however, when the dependencies of software component packages are tracked.

"Every considerable application includes dependencies that call into a big share of JCL's  -- Java Class Library, which comprises the core APIs provided by the Java runtime -- sensitive APIs," Plate said.

The research further revealed that 71% of Census II java packages call five  or more categories of security sensitive APIs when all the dependencies are considered.

"Applications often use only a small portion of the open-source components they integrate, and developers rarely understand the cascading dependencies of components," Plate added. "In order to satisfy transparency requirements while protecting brand reputation, organizations need to go beyond basic SBOMs."

Just knowing which components are included for production isn’t effective anymore -- understanding which functions the components use is critical too, according to the report.

LLM still bad at malware detection

Endor Labs used LLM models from OpenAI and Google Vertex AI to evaluate how they can be used to help classify malware. For the evaluation, both LLMs were presented with identical code snippets as prompts to rate their malicious potential on a scale of 0-9.

"We were interested to learn how consistent their results were for 3,374 test cases. On considering a scoring difference of 0 or 1 to be agreement, we found they agreed on 89% of the cases," the report said.

But both the models fell considerably short at effectively classifying the malware and produced a huge number of false positives. While OpenAI GPT3.5 accurately classified 3.4% of the code snippets, Vertex AI text-bison was 7.9% accurate.

"The main culprit was minified/packed JavaScript, i.e., JavaScript code that was more or less heavily changed in order to save space/bandwidth when transmitting it to a user's browser," Plate said. "Unfortunately, minification and packing is pretty common and do not only exist in npm packages but also in Python packages that bundle some sort of UI. Very often, the LLMs classified such code as malicious just because it looks obfuscated."

"With this number of false positives, the feedback of LLMs becomes almost useless, which is a pity because the feedback for non-obfuscated code is oftentimes very good," Plate added.

He further noted that while preprocessing such code can work at times to reduce false positives, the obfuscation example indicates LLMs struggle with complex programming logic.

Adversaries can use this limitation to evade detection, which may lead to undetected malware, false negatives.

"In my opinion, the biggest takeaway from this analysis is that general purpose LLMs, like GPT, shouldn't be relied upon for specialized purposes," said Katie Norton, an analyst at IDC. "It is still a tricky area for using generative AI because oftentimes when identifying malware you are looking for unknowns, which is something you can't train a model on."