AI agent autonomously solves complex cybersecurity challenges using text-based tools

July 29, 2025 by NYU Tandon School of Engineering

Collected at: https://techxplore.com/news/2025-07-ai-agent-autonomously-complex-cybersecurity.html

Artificial intelligence agents—AI systems that can work independently toward specific goals without constant human guidance—have demonstrated strong capabilities in software development and web navigation. Their effectiveness in cybersecurity has remained limited, however.

That may soon change, thanks to a research team from NYU Tandon School of Engineering, NYU Abu Dhabi and other universities that developed an AI agent capable of autonomously solving complex cybersecurity challenges.

The system, called EnIGMA, was presented this month at the International Conference on Machine Learning (ICML) 2025 in Vancouver, Canada.

“EnIGMA is about using Large Language Model agents for cybersecurity applications,” said Meet Udeshi, a NYU Tandon Ph.D. student and co-author of the research. Udeshi is advised by Ramesh Karri, Chair of NYU Tandon’s Electrical and Computer Engineering Department (ECE) and a faculty member of the NYU Center for Cybersecurity and NYU Center for Advanced Technology in Telecommunications (CATT), and by Farshad Khorrami, ECE professor and CATT faculty member. Both Karri and Khorrami are co-authors on the paper, with Karri serving as a senior author.

To build EnIGMA, the researchers started with an existing framework called SWE-agent, which was originally designed for software engineering tasks. However, cybersecurity challenges required specialized tools that didn’t exist in previous AI systems. “We have to restructure those interfaces to feed it into an LLM properly. So we’ve done that for a couple of cybersecurity tools,” Udeshi explained.

The key innovation was developing what they call “Interactive Agent Tools” that convert visual cybersecurity programs into text-based formats the AI can understand. Traditional cybersecurity tools like debuggers and network analyzers use graphical interfaces with clickable buttons, visual displays, and interactive elements that humans can see and manipulate.

“Large language models process text only, but these interactive tools with graphical user interfaces work differently, so we had to restructure those interfaces to work with LLMs,” Udeshi said.

The team built their own dataset by collecting and structuring Capture The Flag (CTF) challenges specifically for large language models. These gamified cybersecurity competitions simulate real-world vulnerabilities and have traditionally been used to train human cybersecurity professionals.EnIGMA is an LM agent fed with CTF challenges from the NYU CTF benchmark. It interacts with the computer through an environment that is built on top of SWE-agent (Yang et al., 2024) and extends it to cybersecurity. We incorporate new interactive tools that assist the agent in debugging and connecting to remote server. The agent iterates through interactions and feedback from the environment until it solves the challenge. Credit: Talor Abramovich et al.

“CTFs are like a gamified version of cybersecurity used in academic competitions. They’re not true cybersecurity problems that you would face in the real world, but they are very good simulations,” Udeshi noted.

Paper co-author Minghao Shao, a NYU Tandon Ph.D. student and Global Ph.D. Fellow at NYU Abu Dhabi who is advised by Karri and Muhammad Shafique, Professor of Computer Engineering at NYU Abu Dhabi and ECE Global Network Professor at NYU Tandon, described the technical architecture: “We built our own CTF benchmark dataset and created a specialized data loading system to feed these challenges into the model.” Shafique is also a co-author on the paper.

The framework includes specialized prompts that provide the model with instructions tailored to cybersecurity scenarios.

EnIGMA demonstrated superior performance across multiple benchmarks. The system was tested on 390 CTF challenges across four different benchmarks, achieving state-of-the-art results and solving more than three times as many challenges as previous AI agents.

During the research conducted approximately 12 months ago, “Claude 3.5 Sonnet from Anthropic was the best model, and GPT-4o was second at that time,” according to Udeshi.

The research also identified a previously unknown phenomenon called “soliloquizing,” where the AI model generates hallucinated observations without actually interacting with the environment, a discovery that could have important consequences for AI safety and reliability.

Beyond this technical finding, the potential applications extend outside of academic competitions. “If you think of an autonomous LLM agent that can solve these CTFs, that agent has substantial cybersecurity skills that you can use for other cybersecurity tasks as well,” Udeshi explained. The agent could potentially be applied to real-world vulnerability assessment, with the ability to “try hundreds of different approaches” autonomously.

For Udeshi, whose research focuses on industrial control system security, the framework opens new possibilities for securing robotic systems and industrial control systems. Shao sees potential applications beyond cybersecurity, including quantum code generation and chip design vulnerability detection.

The researchers acknowledge the dual-use nature of their technology. While EnIGMA could help security professionals identify and patch vulnerabilities more efficiently, it could also potentially be misused for malicious purposes. The team has notified representatives from major AI companies, including Meta, Anthropic, and OpenAI about their results.

More information: EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities: icml.cc/virtual/2025/poster/45428

Leave a Reply Cancel reply