Large language models (LLMs) are still falling short in performing vulnerability discovery and exploitation tasks.
Many threat actors therefore remain skeptical about using AI tools for such roles.
This is according to new research by Forescout Research – Vedere Labs, which tested 50 current AI models from commercial, open source and underground sources to evaluate their ability to perform vulnerability research (VR) and exploit development (ED).
VR tasks aimed to identify a specific vulnerability in a short code snippet. ED tasks sought to generate a working exploit for a vulnerable binary.
Failure rates were high across all models. Around half (48%) failed the first VR task, and 55% failed the second. Around two-thirds (66%) failed the first ED task, and 93% failed the second.
No single model completed all tasks.
Most models were unstable, often producing inconsistent results across runs and occasionally encountering timeouts or errors. In several ED cases, generating a working exploit required multiple attempts over several hours.
Even when models completed ED tasks, they required substantial user guidance, such as interpreting errors, debugging output or manually steering the model toward viable exploitation paths.
“We are still far from LLMs that can autonomously generate fully functional exploits,” the researchers noted.
Cybercriminals Remain Skeptical About AI Capabilities
The study, published on July 10, also analyzed several underground forums to see how cybercriminal communities view the potential of AI.
Experienced threat actors tended to express skepticism or caution, with many of their comments downplaying the current utility of LLMs.
Enthusiasm for AI-assisted exploitation tended to come from less experienced users.
“Despite recent claims that LLMs can write code surprisingly well, there is still no clear evidence of real threat actors using them to reliably discover and exploit new vulnerabilities,” the researchers wrote.
Many threat actors did highlight the effectiveness of LLMs in performing some technical assistance, such as generating boilerplate code and other basic software automation tasks.
Capabilities Vary Across Different AI Models
The Forescout research found that open-source models were the most unreliable for VR and ED, with all 16 models tested performing poorly across all tasks.
These models were available on the HuggingFace platform, which provides thousands of pre-trained AI models for its community.
“Overall, this category remains unsuitable even for basic vulnerability research,” the researchers noted.
Underground models are fined tuned for malicious use on dark web forums and Telegram channels. These include customized tools developed from publicly available models, such as WormGPT and GhostGPT.
While they performed better than open-source models, these tools were hampered by usability issues, including limited access, unstable behavior, poor output formatting and restricted context length.
General purpose commercial models from major tech providers such as ChatGPT, Gemini and Copilot, delivered the best performance, though some were occasionally limited by alignment safeguards. Even in this category, only three models succeeded in producing a working exploit for the most difficult test cases.
AI Capabilities Set to Grow
Despite the findings, the study observed that generative AI showed rapid improvements in both VR and ED over the three-month testing window.
“These results suggest that generative AI hasn’t yet transformed how vulnerabilities are discovered and exploited by threat actors, but that may be about to change. The age of ‘vibe hacking’ is approaching, and defenders should start preparing now,” the researchers added.
Forescout said it is likely AI will result in exploits becoming more frequent but not more sophisticated. Therefore, core cybersecurity measures such as least privilege, network segmentation and zero trust remain just as relevant in mitigating such attacks.