- Executive Offense
- Posts
- RSA 2026: Hot Takes on AI, Agents, and Offensive Security Reality Checks
RSA 2026: Hot Takes on AI, Agents, and Offensive Security Reality Checks
Hot takes on the real state of AI in offensive security
Hey everyone!
RSA 2026 is in the rearview mirror and I've got a lot to say. This is going to be a longer one, so buckle up. I want to talk about the conference itself, the current state of AI in offensive security (some genuine signal in the noise), the real cost math nobody is doing out loud, and whether "agentic pen testing" is actually what the headlines say it is.
Let's get into it.
The TL;DR
AI is helping both offense and defense, but offense iterates faster while defense gains unprecedented scale through agents
Running 24/7 AI agents via API costs ~$72K/year, still cheaper than a junior pen tester, but scaling gets expensive fast
98% of the industry is still heavily human in the loop, even the best teams
Frontier models are still 10-20% ahead of open source for offensive security work
Model refusals for cybersecurity are increasing but inconsistent and often bypassable
Bug finding benchmarks need scrutiny, most hype is source-assisted, not true black box discovery
/ RSA: Hacker Spring Break
If DEF CON and Black Hat are "Hacker Summer Camp," then RSA is affectionately known as "Hacker Spring Break." Different vibe entirely. More suits, more vendors, more corporate energy. And honestly? The show floor is kind of a monster. It represents a lot of what people don't love about corporate cybersecurity: booths stacked wall to wall with marketing, swag cannons firing at you from every direction, vendors competing for your badge scan. If that were all RSA was, I'd probably skip it.
But that's not what RSA is. That’s the SHOW FLOOR.
RSA has spent a significant amount of time to bump up the technicality and practicality of the conference talks, with some of the worlds best researchers doing talks at the actual con. They also have the villages which ate top tier outfits. On top of this you can also spend time NOT at the conference itself. Some amazing value of RSA is comes in what happens around it. The dinners. The side events. The rooftop meetups, the hallway conversations, the running into someone you've been meaning to catch up with for six months. Your people are there. The community shows up. And even if the showroom floor makes you feel like you're inside a very aggressive SaaS catalog, the conversations happening around it are genuinely valuable.
(Sponsor)
Threat Actors Mine Stealer Logs. Are You Exposed?
Infostealers harvest session cookies that bypass MFA entirely. No password needed. Flare gives defenders the same visibility as initial access brokers, with auto-remediation through your identity provider when stolen cookies, credentials, or employee data surface in stealer logs, dark web forums, or Telegram channels.
✔️ 25B leaked credentials indexed
✔️ 22K+ Telegram channels covered
Start Your Free Trial
(Jason Note: Flare is the BEST company to work with on this problem of identifying leaked creds for your org. We use them at Arcanum and they have the best research team i’ve ever met)
/ AI and Offensive Security: The State of Things (Opinion Section)
I want to flag upfront that this section is opinion. Informed opinion, based on a lot of conversations with people I trust this week, but still opinion. Take it accordingly.
At RSA, we saw almost every major friend and acquaintance in the offensive security AI space. We did a panel with XBOW and OpenAI specifically on AI capabilities for offensive security, which was a great conversation (I'll link resources from that when they're available).
Here's my honest read on where the industry is right now.
Hybrid Teams Are the Emerging Norm
A whole bunch of people are talking about hybrid testing teams now, meaning they're giving their pen testers access to tools like Claude Code and building out custom skills. These had varying ranges of complexity, from some very simple prompting to some very complex, genuinely impressive engineering. The gap between the top and the bottom of this curve is massive, but results across the board were really good.
The "Agentic Pen Testing" Label Problem
We're also seeing a bunch of smaller consultancies or smaller businesses pivot to say they're doing "advanced agentic-based pen testing." Really, when you look under the hood, they're using an agent framework like Claude Code or Codex skills to run their whole system, which has a lot of frailty to it. If one of the frontier model vendors goes down or changes their API terms overnight, you basically have no business. It's a dependency risk that I don't think everyone has fully priced in.
Is AI Helping Offense or Defense More?
The conversation we had with a lot of people at RSA was the very common question of "Is AI helping offense or defense more?" which is a silly question. I think it's both. But I think offense is iterating a little bit faster, whereas defense will take longer to catch up but gains the momentum of scale that defense has never had before by using agents.
The sub question that comes out of here is that it's really hard to attribute how much both offensive security consultancies (the good guys) and also the bad guys are gaining in efficiency by AI building tools for them. A lot of the stuff I talked about at RSA this week was that we have seen on the dark web an explosion of new tools like phishing frameworks and exploit kits and a bunch of software to make hacking easier. But it's not directly referenced in the way that people mean when they ask that question. It's not like AI agents are doing the hacking; it's that bad guys are making better, faster, and easier-to-use tools. And so are offensive security engineers at every consultancy I'm talking to right now. Pretty much everyone has retrofitted their reporting, their reconnaissance, their initial analysis and web apps, their bypass methodology. Everyone is working on skills that help the human by doing this stuff programmatically, but the first instantiation of building all these tools has been done through AI. It's really hard to attribute how much we are gaining if you include that, because it's never really referenced.
From our Arcanum point of view, we have gained immense efficiency gains through building tools using AI, but we also still count a lot on our human testers who have expertise, and they often find things that AI doesn't. So I'm not completely AI-pilled on this topic.
/ The Cost of Running AI Agents for Offensive Security
This was the most interesting undercurrent at RSA this year. I ended up having this conversation across multiple dinners with friends at frontier AI labs, and I tweeted about it live. The thread blew up a little bit and the replies were great, so I want to address some of them here.
One of the things we heard directly from our frontier AI lab friends was that the premium subscriptions (like the Max subscriptions or the Pro subscriptions) for all of the frontier labs are NOT designed for power users like us who could be running an AI agent framework 24/7. Those plans are built for developers coding maybe 4 to 8 hours a day, a few days a week. Power users like us should be using the APIs, which is a significant cost increase.
The $72K Napkin Math
The napkin math is uncomfortable for a lot of people. Heavy API usage for a single user can run $100 to $200 per day, which translates to roughly $6,000 per month, or $72,000 per year. Compare that to a max subscription stack of $7,200 to $10,000 per year and you can see why people are gravitating toward subscriptions. The problem is that the subscription path sits in a grey area when you're running 24/7 agentic workloads.
Sherrod DeGrippo asked the right question in the thread: "How much does a jr Eng or jr pen tester cost? More than $72k but at a certain point, are humans cheaper?" And the answer is that even at $72,000 per year, you're looking at a number that's still LESS than most junior pentester salaries. So for a consultancy running a single 24/7 agentic framework for their clients, the cost is actually defensible on paper. Matt Brown made the same point about ROI: if you used to do X pentests per Y amount of time and now you can do 2X pentests in that same time, the ROI makes sense.
But the catch is that one framework means your whole team shares it. They're taking turns on that box or remote agenting framework. If you truly want to scale your entire testing team, or maybe you're turning your expert testers into super expert testers by giving everyone their own instance, you might need to double or triple that figure. At that level you're in the $144,000 to $216,000 per year range, and that's where it starts getting into real feasibility questions for smaller shops.
Token Cost Optimization
Adam Chester asked two really interesting questions in the thread: 1) How much of that token cost has led to products, services, or improvements to recoup the expense? and 2) What lessons are being learned to reduce token count and be overall more efficient with tokens to reduce cost?
For the first part, and this is very much me just having a conversation with the readers, it's hard to know what has led to actionable improvements because we are so early in most consultancies adopting AI right now. I think that one API subscription at that cost point could absolutely take a team of 5 and make them operate like a team of 10 based on my personal experience. Whether that means they provide better testing or they can fit more testing in a certain time window is still really up in the air. I tend to think that it is providing better testing (more complete, more accurate) but we still spend a lot of time in reporting and operationalizing things over and over again to make them work right. So I don't know if we're exactly saving a bunch of time; we're mostly just providing better testing.
As for reducing token count, token efficiency is its own engineering discipline at this point. A big lever is how you prompt engineer skills and supply documentation to your AI agents, whether via RAG or context injection. Bigger context windows have unlocked an interesting pattern: pre-loading substantial context at the start of a long-running project. That sounds like more tokens, but it actually prevents a bunch of investigative tool calls downstream because the agent already has the answer. Front-loaded cost saves back-loaded cost.
Beyond that, prompt caching and tuning agent verbosity are areas where there's a lot to learn. Claude Code's source code leaked via a misconfiguration this week, and there was a bunch of interesting stuff in there around verbosity engineering at the implementation level.
The Frontier Model Cost Debate
Some of the replies in the thread (including from Val Smith and Brandon Veiseh) were comments around not being reliant on a frontier model because of cost and building your own models. Val mentioned buying their own hardware and running local models, and Brandon talked about post-training their own LLM to bring frontier red teaming capabilities to a cheaper price point. It has just been my observation that even post-tuning or pre-training your own models off of open-source models with security data, at least at this point in time, is still 10 to 20% behind what a frontier SaaS model like Opus or GPT 5.4 can do right now. That 10 to 20% efficacy gap, you can really feel it, especially in offensive security testing. I agree with the idea of building your own, but it's just not there yet. I hope we'll get there in the future.
Joseph Thacker had an interesting take: he's banking on the frontier labs distilling down to smaller, cheaper-to-run models that are benchmarked just as good. This is a hope of mine as well. I don't believe in the "small language models will save us" rhetoric, but I believe that there are cheaper-to-run options, and we could get there in the future at some point.
One thing I'll say here is that, while we at Arcanum have an agent framework that can route different things to different models in order to save on costs, we do concede that the frontier model vendors, if plugged in to manage the whole thing, do a better job right now with the current models. We were very open about the fact this week when we were talking to people that we have seen some limited success with MiniMax's M2.5 in security testing and our own internal evals and benchmarks. So there's that glimmer of hope. But if we're just being honest with everyone right now, the state-of-the-art models are very far ahead. If you were to replace every single agent in that workflow and didn't have to worry about tokens, and replaced it with Opus or GPT 5.4, it's going to be better output in our experience.
Alex Rad asked about using self-hosted, open-source models but on a platform that does better inference with better hardware, such as Cerebras or Groq. We've taught modules in our classes to talk about these technologies and their iterations in graphics cards to run AI inference. The hardware helps with latency and throughput but doesn't close the capability gap with frontier models. You're still looking at that 10 to 20% performance delta, just served faster.
Justin Elze also made a good point about how lots of people were taking the "make agents do everything" approach because they just assumed costs were around $200 per day, when in many cases you could just have an LLM write the code instead of making agents do everything each time. Worth thinking about. We do this a lot, we prompt agents to build tools rather than rely on the model for analysis or fetching.. but sometimes this is a double edged sword, as you loose some of the “intuition” that a model comes up with in the process of using model intelligence.
/ Bug Finding Benchmarks: What You Should Actually Be Looking At
You'll see a bunch of benchmarks for AI finding bugs, and you'll see a lot of news hype around it too. When you're reading these articles or posts, the first question you should ask is: what type of testing was it doing?
Here's how I'd break down the maturity curve from most to least mature:
Source-Assisted Vulnerability Discovery
Most hype, least novel. A lot of this is source-assisted vulnerability discovery that is getting a lot of hype right now, and we really haven't had a problem finding bugs with source code analyzers a lot of the times. We have a bunch of industry standard tools that already do this. Many people's AI harnesses or frameworks are using industry standard tools just in a scaled way, having AI run them, interpret the outputs, and build the exploits. It's genuinely better than the old way, but it's an evolution, not a revolution. So if you're looking at a news article talking about something that found hundreds of bugs, think about what type of testing it was doing.
AI-Assisted Fuzzing
The next evolution has been AI-assisted fuzzing, which is also pre-built tooling a lot of the times, being handled now by agents to run autonomously and at better scale than we've ever had before. This in my opinion is pretty impressive, and AI is getting really good at it. The ability to configure, run, and analyze fuzzing campaigns at a pace and scale that wasn't practical before is a real capability unlock.
Black Box Web Application Testing
Then you have AI doing black box testing in the web application security or AppSec space. This is where a few people are really good at it, but mostly we see a few large vendors being good in this space and then a whole bunch of really mediocre or not very good vendors. You also see the bug hunting community from Bug Bounty really succeeding in this space by building agent tweaks, comprehensive skills, and just pulling from their private methodologies that have never been in any training data set.
This goes to a point I talked about a lot during the week: there's still a lot that is unknown for the AI models when it comes to hacking, especially in black box testing. When you combine a Bug Bounty researcher who has a custom methodology and you prompt a frontier model, you can get really, really good findings. But if you ask the model just by itself without that extra prompt engineering to just "hack this website," you don't get great output a lot of the time, or you get very generic output. The private methodology is the unlock.
Internal Network Testing
The last category is internal testing, and there are even less people doing work here. There's a few big vendors doing AI agents that do internals work (a lot of Active Directory hacking and pivoting like threat actors on the internal network). There are some vendors doing very much automated or agentic purple teaming and adversarial emulation and simulation, and those are all really cool. But as far as agentic autonomous offense agents that do internals, there's only a few, and this is a place where there's a lot of room for development and iteration.
There are a lot of open source projects that will say they can do this kind of stuff, but when you look into them, they are just an MCP into something like Kali Linux or a whole bunch of hacker tools. There's no defined methodology built into them; it's just access to a whole bunch of tools and a frontier model to run those tool calls, and that also produces generic output. This is the frontier worth watching.
Original Research: The New-New vs. the Known-New
One more thing worth saying here. If you look at things like the PortSwigger yearly web top 10, we are still coming out with new primitives for web vulnerabilities every year. Original research is alive and well. But 99% of autonomous or even human-in-the-loop guided AI offensive agents are not finding new research vulnerabilities. They simply aren't that good yet.
The models, even with chain-of-thought and reasoning, are not good at finding NEW-new things. They are very good at finding known-new things. Even if methodologies exist to find the new-new, those methodologies are so few and far between in the vast amount of training data that it's hard to surface them when you ask models to do that type of work. This is a really important distinction that I think gets lost in the hype. AI is incredible at scaling what we already know how to find. It is not doing original vulnerability research. Not yet.
/ Orchestrating AI Agents: Still Very Much Human in the Loop
One of the most persistent misconceptions I run into is the assumption that agentic pen testing is 100% autonomous. People read the benchmarks, see the "AI found X hundred bugs" headlines, and imagine a robot silently hacking while engineers sleep.
That's not what's happening. Not even close.
Even in our own usage of AI in offensive security, it is heavily human in the loop, even with really, really great platform engineering, AI engineering, and skills engineering from some of the best people on our team. Roughly 98% of what's being done across the industry is still heavily human in the loop.
Shoutout to Arian J. Evans who led a bunch of discussion about evaluating orchestrators and agent frameworks, specifically around topics of source of truth, high-level goals, and other features inside of these things.
What "Human in the Loop" Actually Means
Sometimes "human in the loop" specifically means things like:
Validating false positives. Digging deeper on a vulnerability an agent thinks it's found and discovering it's not real. A false positive in a pen test report is a trust-destroying event.
Adjusting risk ratings with human context. An AI doesn't know that a "low severity" finding in this specific client's environment is actually critical because of their particular compliance posture or threat model.
Noticing what the AI missed. Contextually spotting features, applications, routes, or paths in a test that AI completely did not programmatically test, and finding vulnerabilities there by prompting it to specifically look there.
Nudging an AI agent to do bypasses. Sometimes it takes a human's experience to push the agent in the right direction.
Engineering against early quitting and shortcuts. Models will sometimes take the path of least resistance and declare a test "done" when a human tester knows there are areas they haven't fully explored.
Red-teaming one AI's report with a second AI. This is a genuinely useful pattern: using a fresh model to critique and find gaps in the first model's output.
Dealing with hallucinations. Less frequent than a year ago, but verbatim hallucination still surfaces every once in a while in frontier models.
Business logic abuse that requires threat modeling. AI agents cannot independently discover fraud patterns like coupon stacking, refund arbitrage, or approval flow skips. These require a human who understands the intended system behavior and can model how it can be abused.
Chaining vulnerabilities across system boundaries. A classic example: SSRF to cloud metadata to temporary credentials to privilege escalation via mis-scoped IAM. That requires cross-system reasoning that current agents don't do well on their own. A human sees the chain; the agent sees individual findings.
Scope decisions and environmental awareness mid-test. Is this target in scope? Is this a production database I'm about to run destructive payloads against? Should I back off because I'm about to get IP-banned at 2 AM? Agents don't have the contextual judgment to know when to stop or redirect, and the consequences of getting it wrong can be severe. EVEN EXPLICITLY PROMPTED or GUARDRAILED AIs have gone out of scope in bounty hunts of ours. Just FYI.
The human is the conductor. The AI is a very fast, very capable section of the orchestra. That framing is more accurate than "autonomous AI hacker."
/ Model Refusals and Cybersecurity Evals
The other thing that came up a lot is refusals and evals for models. We do a set of refusal checking and a whole bunch of cybersecurity-based evals on every model that we want to test. It's mostly automated scripts, with an AI-as-judge for output evaluation. Because of the non-determinism, we run several benchmarks on the same query. Most of this is around using tools that are very common inside of the offensive security realm and then also trying to achieve different scenarios on different labs that we've set up.
What we noticed is that while not crazily pervasive yet, both of the big frontier model companies are applying more refusals to cybersecurity testing than ever before from our history of doing this, starting about six months ago. We saw general refusals for a lot of blue team, purple team, and red team work dealing with credentials and the model refusing to do anything around that, including using any tools that are common credential capture tooling. But now we're seeing refusals creep into general web exploitation work, binary exploitation work, reverse engineering, and network pen testing.
It doesn't seem to have any rhyme or reason because of the non-determinism. Often you can re-ask the exact same query in a new session and get zero refusal, or rewrite the ask slightly and it goes through. So while there ARE refusal gates being applied to cybersecurity, they're not as consistently enforced as some people suggest. But there are meaningfully more than there were 8 months ago. Worth keeping your own evals running so you can track the drift over time.
/ Outro
That's it for this one. There was way too much going on at RSA to cover in a single issue, but I wanted to get the big themes down while the conversations were still fresh. The AI and offensive security discussion is moving so fast right now that even the takes I had on Monday evolved by Friday based on new conversations and new things we saw.
If you're building agents for security work, if you're a consultancy trying to figure out how to adopt AI, or if you're a bug bounty hunter wondering if this stuff is going to replace you (it's not), I hope this gave you a more honest picture of where we actually are. Not where the hype says we are. Not where the vendor slides say we are. Where we actually are.
Some FYI’s: our The Bug Hunters Methodology Live Course is April 8-10, and we will be at OWASP Snowfroc in Denver April 16-18. Both worth your time if you can make it. Thanks for reading, and as always, feel free to reply or hit me up on Twitter if something in here sparked a thought. This community is the whole point.
Happy hacking 😎
-Jason
