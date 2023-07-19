As applications increasingly include prepackaged software components that take advantage of generative AI capabilities based on large language models (LLMs), the danger of vulnerabilities making it into production looms greater than ever, according to research by cybersecurity company Endor Labs.\n\nA new report dubbed \u201cState of Dependency Management 2023\u201d notes that even though application developers use only a fraction of these packages \u2014 components such as libraries and modules designed for easy use and installation into software programs \u2014 they have numerous dependencies and may make risky API calls.\n\n\u201cLLMs are a great support for many day-to-day programming tasks. However, it is important for developers to verify the output provided by LLMs before including it in production code,\u201d said Henrik Plate, security researcher at Endor Labs and the author of the report.\n\nFor the research, Endor Labs used the Census II data set from the Linux Foundation and Harvard, the company's in-house API categories and vulnerability database, open source Github repositories, and packages published in the npm and PyPI package repositories.\n\nOnslaught of LLM\/AI enabled packages\n\nWhile tracking newly published packages uploaded to the npm and PyPI repositories that made calls to the OpenAI API, Endor Labs found, since the launch of ChatGPT\u2019s API in January 2023, more than 636 new PyPI and npm packages created to use the API. Additionally, 276 already existing packages added support for ChatGPT API.\n\nThe research also noted that this is just a subset of the total number of ChatGPT-enabled packages, as the number of private projects experimenting with LLMs is even bigger.\n\nWhen Github repositories for its Top 100 AI projects were scanned, they were found to reference, on average, 208 direct and transitive dependencies. Eleven percent of the projects were found relying on 500 plus dependencies. \n\nFifteen percent of these Github repositories contain 10 or more known vulnerabilities. The package distributed by Hugging Face Transformers (the architecture that ChatGPT is based on) has over 200 dependencies, which include four\u00a0known vulnerabilities.\n\nDependencies make calls to security-sensitive APIs\n\nFifty-five percent of applications tracked by Endor make calls to security-sensitive APIs \u2014 programming interfaces that link to critical resources which, if compromised, could affect the security of an asset. That number grows to 95%, however, when the dependencies of software component packages are tracked.\n\n\u201cEvery considerable application includes dependencies that call into a big share of JCL\u2019s \u2014 Java Class Library, which comprises the core APIs provided by the Java runtime \u2014 sensitive APIs,\u201d Plate said.\n\nThe research further revealed that 71% of Census II java packages call five or more categories of security sensitive APIs when all the dependencies are considered.\n\n\u201cApplications often use only a small portion of the open-source components they integrate, and developers rarely understand the cascading dependencies of components,\u201d Plate added. \u201cIn order to satisfy transparency requirements while protecting brand reputation, organizations need to go beyond basic SBOMs.\u201d\n\nJust knowing which components are included for production isn't effective anymore \u2014 understanding which functions the components use is critical too, according to the report.\n\nLLM still bad at malware detection\n\nEndor Labs used LLM models from OpenAI and Google Vertex AI to evaluate how they can be used to help classify malware. For the evaluation, both LLMs were presented with identical code snippets as prompts to rate their malicious potential on a scale of 0-9.\n\n\u201cWe were interested to learn how consistent their results were for 3,374 test cases. On considering a scoring difference of 0 or 1 to be agreement, we found they agreed on 89% of the cases,\u201d the report said.\n\nBut both the models fell considerably short at effectively classifying the malware and produced a huge number of false positives. While OpenAI GPT3.5 accurately classified 3.4% of the code snippets, Vertex AI text-bison was\u00a0 7.9% accurate.\n\n\u201cThe main culprit was minified\/packed JavaScript, i.e., JavaScript code that was more or less heavily changed in order to save space\/bandwidth when transmitting it to a user\u2019s browser,\u201d Plate said. \u201cUnfortunately, minification and packing is pretty common and do not only exist in npm packages but also in Python packages that bundle some sort of UI. Very often, the LLMs classified such code as malicious just because it looks obfuscated.\u201d\n\n\u201cWith this number of false positives, the feedback of LLMs becomes almost useless, which is a pity because the feedback for non-obfuscated code is oftentimes very good,\u201d Plate added.\n\nHe further noted that while preprocessing such code can work at times to reduce false positives, the obfuscation example indicates LLMs struggle with complex programming logic.\n\nAdversaries can use this limitation to evade detection, which may lead to undetected malware, false negatives.\n\n\u201cIn my opinion, the biggest takeaway from this analysis is that general purpose LLMs, like GPT, shouldn\u2019t be relied upon for specialized purposes,\u201d said Katie Norton, an analyst at IDC. \u201cIt is still a tricky area for using generative AI because oftentimes when identifying malware you are looking for unknowns, which is something you can\u2019t train a model on.\u201d