Executive Offense
Posts
Killer Agent Framework features

Killer Agent Framework features

the 16 features you want most in agents...

Jason Haddix
March 16, 2026

I've been giving a talk lately on what actually separates useful AI agents from glorified autocomplete. Not which model is smarter, that's a distraction. The real differentiator is the feature set.

This week felt like a good time to write it up.

In the last month, Andrej Karpathy dropped autoresearch: an autonomous ML research agent that designs experiments, runs them overnight, evaluates results, and commits the wins. 333 experiments, 35 parallel agents, no humans in the loop. 37k+ GitHub stars in about a week. Impactful.

Meanwhile, the open-source ecosystem keeps building. Pedram Amini's Maestro is a multi-agent orchestration platform that coordinates different AI coding agents in parallel, with an Auto Run mode that's logged nearly 24 hours of continuous unattended operation. Daniel Miessler's PAI is the most thought-through framework for personal AI infrastructure I've seen: currently at v4.0.3 with deep memory architecture, a full hook system, and a structured approach to agent reasoning.

What makes all of these compelling isn't the models powering them. It's the features they're built around. Here are the 16 that matter.

/ 01

Command Line Access

Think about any task you want to accomplish in the world: scheduling a meeting, running a security scan, parsing a dataset, deploying code, sending an alert. At the base level, almost all of it can be done on the command line. That's the insight. The terminal isn't just a tool; it's the universal substrate that everything else is built on. An agent with real shell access inherits the entire history of human tooling: every language, every service, every utility ever written. An agent without it is limited to whatever its developers thought to include.

/ 02

Spawning Sub-Agents

Sub-agent spawning is what turns a capable assistant into a capable system. Instead of one agent doing five things serially, you spin up five agents doing them in parallel: with isolation (a failure in one doesn't crash the others) and specialization (different agents, different tools, different models for different tasks). But parallelism alone isn't the full story. The sub-agents themselves need to come equipped with genuinely useful tools. Web search is table stakes: not "I'll try to recall from training" search, but real, live retrieval of current information. A sub-agent doing research without quality web access is operating blind. The best frameworks treat this as a first-class requirement.

/ 03

Agents via Prompting (Skills)

Claude Code pioneered what you could call the no-code agent-building era. The insight: you don't need to write a plugin or a package to give an agent a new capability. Write a markdown file: clear instructions, defined behaviors, structured context: and the agent reads it. That shift was genuinely significant. It meant non-developers could extend agent capabilities. Iteration became editing a text file instead of recompiling a plugin. The skill itself became the documentation. That pattern has now spread across the ecosystem. Multiple frameworks have adopted variations of this model, and the breadth of skills available: from security research to content creation to data analysis: continues to grow. The no-code agent-building model is now standard.

/ 04

Cross-Model Portability

The real value of a multi-model architecture isn't just flexibility: it's economics. You don't need your most powerful and expensive model doing every task. Use your flagship model for complex reasoning and decision-making. Route web searches, summarization, and routine tool calls to a faster, cheaper model. The result: smarter overall output at a fraction of the cost of running everything through your top model. It's worth being precise about what "cross-model" actually means here, though. Choosing an AI provider at the start of a project is provider switching. True cross-model portability means individual sub-agents within a single workflow can use different models: each task matched to the right level of capability and cost. That distinction separates flexibility from real optimization.

(Sponsor)

Your SOC is a queueing system. It behaves like one, too.

Most SOC improvement work focuses on what happens after an investigation starts. Faster playbooks, better context, tighter workflows. All useful.

But for a lot of teams, the bigger problem is what happens before anyone even looks at the alert. Alerts come in. Analysts triage and escalate. When the arrival rate exceeds capacity, queues build and wait time spikes.
Prophet Security's new e-book, "The Queue is the Breach," walks through the operational math behind this: alert cycle time, wait time across severity levels, analyst utilization, and what those metrics actually reveal about whether your bottleneck is people, process, or the operating model itself.

Check it out HERE

/ 05

Hooks & Lifecycle Injection

Hooks are invisible middleware that fires at specific points in the agent lifecycle: session start, before and after tool calls, when prompts are submitted, when sub-agents spawn and complete. At each point, you can inject context, run validation, transform data, or trigger side effects that the model never needs to be explicitly told about. The practical effect is dramatic: hooks let you turn a general-purpose model into a context-aware system that automatically loads relevant memory, enforces security policies, and extracts learnings at the end of every session: all without the user doing anything differently. The model's weights haven't changed. But its effective intelligence in your specific context goes up significantly. The best implementations of this: including pre-built hook libraries designed for exactly this kind of context enrichment: show what's possible when lifecycle injection is treated as a first-class feature rather than an afterthought.

/ 06

MCP (Model Context Protocol)

At the base level, your agent framework should support using MCPs. Think of MCP as a complement to skills: where a skill teaches your agent how to do something through natural language instructions, an MCP server exposes a structured interface to an external service that any compatible agent can call. Write the server once: for GitHub, Slack, a database, a custom API: and every MCP-supporting agent gets access. That's the interoperability win. But there's a more forward-thinking idea emerging: some frameworks are beginning to expose the entire agent framework itself as an MCP server. That means another AI can invoke your agent as a tool. Your agent becomes a capability inside a larger AI orchestration. Agents calling agents through a protocol. It's inception: and it's where multi-agent architecture is headed.

/ 07

Memory & Persistence

The biggest practical problem agents face is running out of context: and most frameworks don't have a real answer for it. Even with the large context windows we're seeing now, real project work fills them up. A complex security analysis, a large codebase refactor, an extended research engagement: these push past even 200k token windows faster than you'd expect. When the agent framework compacts, you lose fidelity. Work gets summarized. Decisions get forgotten. The agent starts the next session with no memory of what it learned. Without a dedicated memory system, every session is effectively starting from scratch. A real memory architecture: one that persists decisions, extracts learnings, and survives context resets: isn't a nice-to-have. On any project that matters, it's mandatory.

/ 08

Cron & Autonomous Scheduling

This is the most slept-on feature in agent infrastructure, and the one that changes the most about how you actually work with AI. The established mental model everyone has been optimizing around is pull: you go to the AI, you ask something, you get a response. Cron scheduling flips that. Your agent reaches out to you. It does things on your behalf while you're asleep, in meetings, or focused on something else: sends you a morning briefing before you're awake, runs a scan on a new target while you're on a call, monitors your inbox and surfaces only what's urgent, updates your memory at the end of the day. All without being asked. Don't fall into the trap of only thinking about the established AI use cases everyone else is building. Cron is the feature that shifts AI from a tool you use to a co-worker that works alongside you.

/ 09

Autorun

Closely related to cron, but distinct: autorun is the ability to execute multi-step tasks end-to-end without human checkpoints. The agent takes a goal, plans the steps, executes them, and delivers the result: all without asking for approval at each stage. The key variable is trust level. Most serious frameworks give you a spectrum: require approval for everything, auto-approve safe operations while flagging sensitive ones, or run fully unsupervised. Knowing where your agent sits on that spectrum: and configuring it intentionally: is the difference between a system that's useful and one that's dangerous. The best implementations let you tune this per-tool or per-context rather than as a single global setting.

/ 10

Iterative Research Loops

One-shot answers are rarely the best answers. Iterative research loops are what happens when you replace a single query with a structured process: broad initial coverage, gap identification, targeted deep-dives, then integration into a comprehensive output. Each loop builds on the last. A fresh evaluator at each iteration catches what the previous pass missed. The pattern is what Karpathy's autoresearch demonstrates at the ML experiment level: 333 experiments don't come from one query. They come from a loop that runs, evaluates, and runs again. The same principle applies to security research, competitive intelligence, and any domain where the first answer is rarely the complete answer.

/ 11

Multi-Channel Delivery

An agent that only lives in a terminal isn't fully autonomous: it's a CLI wrapper you have to babysit. Multi-channel delivery means the agent reaches you where you are: Telegram, Discord, Slack, iMessage, WhatsApp, Signal. Same agent, same memory, same capabilities: delivered to whatever surface you're actually on. This becomes critical when paired with cron and autorun: an agent that does work while you sleep needs a way to surface results when you wake up. Without channel delivery, autonomous operation is only half the picture.

/ 12

Context Engineering

The same model weights produce completely different results depending on what's in context. Context engineering is how you program agent behavior without writing code: defining identity, operating procedures, environmental knowledge, and curated memory in human-readable files that the agent loads at the right moments. As Karpathy put it: "The hottest new programming language is English." A well-engineered context stack means the agent knows your preferences, understands your workflow, and applies the right frame to every interaction: without being explicitly told each session. This is the meta-skill that makes everything else work better.

/ 13

Unified IDE Experience

For development-focused workflows, the IDE is where everything converges: code, terminal, agent, file browser, and preview in a single view. Visual diffs across multiple files, codebase-wide indexing with automatic context retrieval, dual-mode operation between tab completion and full agent mode. The tradeoff is real: IDE-integrated agents often sacrifice raw agent power (sub-agents, cron, hooks, channels) for developer experience polish. Both have their place depending on your primary use case. But for teams where coding is the primary workflow, the unified experience changes how fluidly you can move between writing and letting the agent write.

/ 14

Code-First Problem Solving

LLMs are non-deterministic by nature. Ask "how many files are in this directory?" and the model guesses. Have it run a command and you get the exact answer every time. The principle: when you can solve something with code or a tool call, do that instead of asking the model to reason about it. The result is a measurable improvement in accuracy and reliability: not from a smarter model, but from using the model more precisely. Code-first problem solving means the agent reaches for the terminal first, asks the model to interpret the result, and reserves pure LLM reasoning for tasks that genuinely require it.

/ 15

Program of Thought

The deepest feature on this list. Rather than just telling the agent what to do, Program of Thought teaches it how to think: a general problem-solving framework embedded in the agent's context. Define the current state. Define the ideal state with testable, binary success criteria. Hill-climb toward it. Verify each criterion. Iterate. Paired with structured thinking tools: adversarial self-review before significant actions, multi-perspective analysis, attacking your own proposals before executing them: this approach changes the quality of decisions the agent makes, not just the speed. It's the difference between an agent that does tasks and one that solves problems.

/ 16

Sandboxing & Security

Giving an agent real capabilities means the blast radius of a mistake: or a compromise: is real. Sandboxing is how you bound that risk. Defense in depth: control what tools the agent can call, validate commands before they execute, monitor output for sensitive data, tag content from external sources as untrusted, and escalate actions above a certain risk threshold to human approval. Network access off by default. Audit logs on everything. The goal isn't to neuter the agent: it's to make autonomous operation trustworthy. An agent you can always constrain and audit is one you can actually give real capabilities to.

/ Putting It Together

None of these features are magic in isolation. The power is in composition. Command line access + sub-agents + cron + multi-channel delivery = a system that operates around the clock and keeps you informed. Hooks + memory + context engineering + program of thought = an agent that gets meaningfully smarter about your specific work over time. Code-first + iterative research loops + autorun = research that runs while you sleep and surfaces what actually matters.

Use this list as a scorecard. When you're evaluating a framework, or building on top of one, these are the features worth demanding.

Worth your time:

karpathy/autoresearch: The features in practice. Read the README first.
RunMaestro/Maestro: Multi-agent orchestration, open source.
danielmiessler/PAI: Deep thinking on memory, hooks, and structured reasoning. v4.0.3.