In the past two years, businesses have been trying to fit large language models (LLMs) into support, analytics, development, and internal automation like never before.
Along with the increasing adoption of AI technology, another trend is gaining momentum — cybercriminals are taking advantage of the disconnect between assumptions about LLMs and their actual characteristics.
In 2025 and 2026, several independent sources have highlighted the same trend: Prompt injection remains one of the most impactful and widely demonstrated attack vectors against LLM systems. The OWASP LLM Top 10 (2025) lists prompt injection as LLM01, identifying it as the most critical category of LLM‑specific vulnerabilities, for the second consecutive edition. OWASP’s ranking reflects the fact that LLMs still struggle to reliably separate instructions from data, making them susceptible to manipulation through crafted inputs.
CrowdStrike’s 2026 Global Threat Report — built on frontline intelligence across more than 280 tracked adversaries — documented that threat actors injected malicious prompts into legitimate generative AI tools at more than 90 organizations in 2025. They then used those injections to generate commands that stole credentials and cryptocurrency. The report stated it plainly: “Prompts are the new malware.” AI-enabled adversaries increased their overall attack volume by 89% year-over-year, with prompt injection working as both an entry point and a force multiplier.
Real‑world incidents illustrate the operational impact. In August 2024, researchers at PromptArmor disclosed a prompt injection vulnerability in Slack AI that allowed an attacker to exfiltrate data from private Slack channels they had no access to — including API keys shared in private developer channels — by placing a malicious instruction in a public channel or embedding it in an uploaded document.
In June 2025, researchers at Aim Security disclosed EchoLeak (CVE-2025-32711, CVSS 9.3), the first documented zero-click prompt injection exploit against a production AI system, targeting Microsoft 365 Copilot. By sending a single crafted email, no user interaction required, an attacker could cause Copilot to access internal files and transmit their contents to an attacker-controlled server.
Both vulnerabilities were patched. These incidents underscore the fact that prompt injection is not a theoretical weakness but a practical, repeatable threat organizations must address as they deploy AI systems at scale.
Prompt injection techniques have undergone major evolutions over recent years, now targeting multi-agent architecture, retrieval-augmented generation (RAG) pipelines, model routers, and long-term memory capabilities.
Businesses deploy LLMs to process instructions, summarize information, and trigger automated workflows, but it is difficult for LLMs to tell:
Instructions from data
Information from context
Context from metadata
User intent from metadata
This creates an opportunity for attackers to manipulate and influence the model’s behavior, either directly or indirectly.
Cross-model prompt injection
LLM use is a common practice among enterprises. Attackers corrupt the output of a particular model, knowing well that other models would be processing the content. Hence, the corruption propagates through all AI systems.
RAG supply chain poisoning
Attackers create malicious information — documentation, blog articles, GitHub READMEs. Then they wait until this malicious information is ingested in enterprises’ RAG pipelines, then use it as an attack vector.
Agent hijacking
AI agents have evolved to the point where they can send emails, modify cloud infrastructure, execute code snippets, and interact with internal corporate systems. It takes just a single instruction to make agents act differently in a harmful manner.
Context overflow attacks
With the help of million-token context windows, attackers place malicious code within the document and hope that an LLM will stumble upon it and execute it, thus overriding all previous instructions.
Memory poisoning
Due to the implementation of long-term memory in LLMs, attackers can inject instructions that permanently reconfigure their state.
Model‑router manipulation
Enterprises increasingly use model routers to select between multiple LLMs. Attackers craft prompts that force routing to the weakest or least‑guarded model.
Prompt injection is not a theoretical problem. It directly affects:
Customer‑facing systems (chatbots, support agents)
Internal copilots (developer tools, security assistants)
Automation workflows (ticketing, cloud operations, HR processes)
Data governance (RAG pipelines, knowledge bases)
The risk is no longer limited to “the model said something it shouldn’t.”
In 2026, prompt injection can:
Trigger unauthorized actions
Leak sensitive data
Corrupt internal workflows
Manipulate analytics
Alter business logic
Compromise multi‑agent systems
The attack surface has expanded dramatically.
1. Constrain model permissions
Limit what the model can do, not just what it should do.
2. Segment untrusted content
Treat all external data — including RAG sources — as potentially hostile.
3. Monitor tool invocation
Require human approval for high‑impact actions.
4. Validate content provenance
Ensure RAG pipelines don’t ingest poisoned external content.
5. Harden model routers
Prevent attackers from forcing routing to weaker models.
6. Treat LLMs as untrusted components
This mindset shift is the foundation of modern AI security.
Prompt injection remains the most effective way to compromise enterprise AI systems because it exploits the fundamental way LLMs interpret text. Until organizations treat LLMs as untrusted interpreters — not autonomous decision‑makers — prompt injection will continue to dominate the AI threat landscape.
Julie Brunias is an AI Security Architect.
Anthropic recently told its growth team to hire more product managers, not fewer. The reason, as reported in industry coverage, was that Claude Code had quietly turned its engineering org into a team that ships at roughly three times its actual headcount, and the bottleneck moved from the integrated development environment (IDE) to the people deciding what to build.
That detail is easy to miss in the noise of every AI productivity claim. It is also the structural shift the rest of the industry is now living through. The bottleneck in software is no longer typing. It is deciding what to type. And the engineers who treat that as someone else’s problem are about to plateau.
For most of the last decade, that decision sat with someone else. Software engineering was a craft you absorbed slowly, then practiced in a long, predictable sequence: Dive deep on the technology, write the code, ask Stack Overflow when stuck, escalate to a senior engineer when Stack Overflow failed, ship the ticket. The product manager owned the funnel. The engineer owned the build. Both sides treated this division as physics.
Then the funnel collapsed in five steps.
The Stack Overflow era (2014 to late 2022): The way engineers thought lived in one place. But new monthly questions on Stack Overflow are now down roughly 77% since November 2022, which was not coincidentally when ChatGPT launched. The drop is not a referendum on the site. It is a referendum on the workflow it represented.
The browser-tab era (late 2022 to 2024): The first ChatGPT generation sat outside the IDE. Engineers ran the same loop they had always run, just with a faster oracle: Write a prompt in a browser, paste the answer back into VS Code, repeat. The work was still single-threaded and engineer-driven. The leverage was real but local.
The IDE-native era (2024 to 2025): Cursor and Claude Code moved the model inside the editor and gave it access to the full repository. The senior-engineer escalation path largely dissolved. For years, the prevailing wisdom among veteran engineers was that Bash had the longest shelf life of any tool in the stack. By 2026, for a meaningful share of working developers, the first command typed in a fresh terminal is claude.
The spec-driven era (2025 to 2026): Larger context windows turned single-session work into something that previously required tickets, design docs, and sprints. Amazon’s Kiro IDE team reportedly compressed feature builds from two weeks to two days using the same spec-driven workflow they were shipping. An AWS engineering team described an 18-month rearchitecture, originally scoped for 30 engineers, was completed by 6 people in 76 days. The bottleneck stopped being how long it takes to write the code. It started being how clearly the team can describe what correct looks like.
The routines era (2026): In April, Anthropic shipped Claude Code Routines: Scheduled, persistent agents that run on a cadence, on a webhook, or overnight while the laptop is closed. Cron came back. Hooks came back. The engineer’s job is now part orchestration: Spin up a swarm before bed, review a stack of pull requests in the morning. Third-party wrappers like OpenClaw, which was briefly suspended by Anthropic in April before partial reinstatement, made the same point from the open-source side.
Engineering has roughly tripled. Product management has not budged. The traditional 1:8 ratio of PMs to engineers, already strained, now plays out closer to an effective 1:20 because each engineer ships more per day. For instance, LinkedIn replaced its associate product manager track with a “Product Builder” program that trains generalists across product, design, and engineering. Anthropic is hiring more PMs, not fewer. The pattern is consistent across companies that have actually deployed agentic workflows in production: The system is producing built features faster than it is producing decisions about what should be built.
For engineers, this is the most important career signal of the decade, and the easiest one to miss while the productivity stories dominate the feed.
The instinct to declare fundamentals obsolete in the agent era gets the trend exactly wrong.
When a memory leak takes down production at 3 a.m., and the cause turns out to be a subtle ownership bug pushed 4 years ago, no agent currently in the wild closes that loop end-to-end. Operating systems, networks, concurrency, and query plans still decide who can resolve a real incident. They also decide who can spot the moments when an agent’s output looks correct on the surface and is quietly, expensively, wrong underneath. The agent that wrote 70% of the code in a modern repo cannot reliably tell anyone where its assumptions about thread safety, memory ownership, or transaction isolation diverged from the runtime. The engineer who can read the diff and catch that is the engineer the rest of the team needs in the room, and that engineer is built on fundamentals, not on prompting skill.
The corollary is that fundamentals are now a leverage skill, not a hygiene skill. In 2014, knowing how a TCP retransmit worked got a debug ticket closed faster. In 2026, the same knowledge keeps an entire agent-driven release pipeline from shipping a regression at scale. The blast radius of the engineer who knows what is happening underneath has gone up, not down.
Engineers in 2026 generate code at a rate that exceeds what any of them can read carefully. The team that ships fast and survives is the team whose engineers treat reviewing AI-generated code with at least the same rigor they once reserved for writing it. The 2025 Stack Overflow developer survey put 84% of developers on AI tools, with 46% saying they do not trust the output, up sharply from 31% the year before. That gap, heavy use paired with low trust, is exactly where review skills now matter most. Coders who push lots and review little are accumulating a debt that will come due during the first real incident, and the engineer who can pay it back is the one who paired their volume with deep first-principles knowledge of the systems involved.
Both of those are necessary. Neither is sufficient. The engineer who matters in 2026 is the one who has stopped waiting for the funnel to arrive in the form of a Jira ticket.
That means doing things the role was historically allowed to skip.
Talk to customers. Watch how they actually use the product. Read the support queue. Sit in on the sales call. The signal a product team gets through three layers of summary, an engineer can now get firsthand in an afternoon.
Generate ideas, not just estimates. The product manager who used to source ideas for 8 engineers cannot source ideas for 20 at the same fidelity. The engineer who shows up with a validated, scoped opportunity is no longer doing the PM’s job. The engineer is doing the job the new ratio requires.
Work backwards from the customer. Amazon has been writing the press release first for two decades. The discipline travels well to teams of one and to swarms of agents. Both produce a great deal of working software in the wrong direction without a clear statement of what “customer wins” means before any code is written.
Stop hiding behind bandwidth. The honest answer to “Do you have capacity for this idea?” used to be ‘No.’ With routines, hooks, and a cooperative agent stack, the honest answer is closer to “What is the idea worth?” That is a different conversation, and a much harder one to have without a real point of view on the customer.
The five-phase history above is not really a history of tools. It is a history of which part of the job a human had to do. The part that is still human, and that will remain human for the foreseeable future, has moved up the funnel: From typing, to reviewing, to deciding, to choosing the customer to serve and the problem to solve.
The 2026 version of a great engineer is not the one who writes the most code. It is the one who knows what to build, can prove it is worth building, and has the agent fleet plus the review discipline to ship it without the system collapsing under its own velocity.
Engineers who internalize this will spend the next decade doing the most interesting work software has ever produced. Engineers who wait for a ticket will spend it watching the ticket get written by the agent next to them.
Ishan Gupta is a software engineer at Amazon.
AI coding agents are rapidly accelerating data engineering by generating transformations, pipelines, orchestration workflows, validation tests, and infrastructure configurations from prompts. However, enterprise data platforms have long operated across…
The history of distributed computing is one of protocol proliferation followed by consolidation.
Common Object Request Broker Architecture (CORBA), Distributed Component Object Model (DCOM), Java remote method invocation (RMI), and early simple object access protocol (SOAP) competed for the enterprise integration market in the late 1990s before representational state transfer (REST) quietly won by being simpler and HTTP-native.
Extensible Messaging and Presence Protocol (XMPP), Internet Relay Chat (IRC), and a dozen proprietary protocols fragmented real-time messaging before MG telemetry transport (MQTT) and WebSockets carved out their respective niches. Every new computing paradigm generates a burst of competing standards, then slowly converges as implementations accumulate and interoperability becomes economically necessary.
The AI agent ecosystem is currently in the proliferation phase. Four significant protocols have been published in the past eighteen months: Model context protocol (MCP) from Anthropic in late 2024, agent communication protocol (ACP) from IBM Research in March 2025, Agent2Agent (A2A) from Google in April 2025, and agent network protocol (ANP) from an independent working group.
The W3C AI Agent Protocol Community Group has opened a standards track. The Internet Engineering Task Force (IETF) is receiving Internet-Drafts on agent transport. Conferences are running workshops on interoperability. Every week brings a new GitHub repository claiming to solve the agent communication problem.
Understanding where and how quickly this converges has real consequences for architecture decisions being made right now.
The proliferation looks more chaotic than it is, because most of these protocols address different layers of a stack rather than competing for the same slot. The confusion comes from marketing, which describes each as “the standard for AI agent communication” without specifying which aspect of communication.
MCP is a tool-calling interface. It defines how a model discovers what functions a server exposes, how to invoke them, and how to interpret the response. It is a typed remote procedure call (RPC) contract between a model client and a tool server, running over HTTP. The Linux Foundation confirmed more than 10,000 active public MCP servers and 164 million monthly Python SDK downloads by April 2026. MCP has already won the tool-calling layer. The standardization work is effectively done.
A2A is a task coordination interface. Where MCP defines how an agent calls a tool, A2A defines how two agents delegate a task. It introduces Agent Cards (capability advertisements), task lifecycle states, and three interaction modes: Synchronous, streaming, and asynchronous. Google donated it to the Linux Foundation in June 2025, and enterprise AI teams have adopted it broadly because it fills a real gap that MCP leaves open.
ACP is a message envelope format. Lightweight, stateless, designed for agent-to-agent message exchange without A2A’s full coordination semantics. It is useful in systems where simple message passing suffices and A2A’s task lifecycle overhead is unnecessary.
ANP is a discovery and identity protocol. It uses Decentralized Identifiers (DIDs) for agent identity and JSON-LD graphs for capability descriptions, providing a foundation for decentralized agent marketplaces where no central registry is required.
The stack that is emerging: Capability discovery via ANP or simpler registries, task coordination via A2A, tool calls via MCP, and lightweight messaging via ACP for cases that do not require full task lifecycle management. These layers complement rather than compete.
Every protocol in this list runs over HTTP. This reflects where the protocols came from: Research teams, API providers, and enterprise software companies building systems where HTTP is an unquestioned assumption. HTTP is the protocol they know, the one their servers already speak, and the one that makes demos easy.
The production problem is that HTTP assumes a reachable server. Behind network address translation (NAT) — and 88% of networked devices sit behind NAT — there is no reachable server without a relay. For agent fleets that need to route tasks directly between peers across cloud boundaries, home networks, and edge deployments, this centralization forces every message through relay infrastructure. Relay infrastructure adds latency, cost, and a failure mode.
The application-layer protocols solve the semantics of what agents say to each other. They do not solve how agents find each other and establish direct connections. That is a session-layer problem, Layer 5 in the open systems interconnection (OSI) model and none of MCP, A2A, ACP, or ANP address it.
The technologies for solving it exist. UDP hole-punching with session traversal utilities for NAT (STUN) provides NAT traversal for roughly 70% of network topologies. X25519 Diffie-Hellman and AES-256-GCM provide authenticated encryption at the tunnel level without a certificate authority. Quick UDP internet connections (QUIC) (RFC 9000) or custom sliding-window protocols over user datagram protocol (UDP) provide reliable delivery without TCP’s head-of-line blocking. These are the same primitives that WireGuard uses for VPN tunnels and that WebRTC uses for browser-to-browser media streams.
What differs in the agent context is capability-based routing. Agents need to find peers not by hostname but by what those peers can do. A research agent should be able to query “which peers have real-time foreign exchange data?” and receive a list of currently active specialist agents. This is closer to a service registry than to DNS, and it is a natural extension of ANP’s design philosophy applied to the transport layer.
A handful of projects are assembling these pieces. Pilot Protocol has the most complete published specification, with an IETF Internet-Draft covering addressing, tunnel establishment, and NAT traversal for agent networks. libp2p provides a battle-tested foundation with similar primitives. The IETF’s QUIC working group is developing NAT traversal extensions that will be relevant here.
The HTTP-based protocols (MCP, A2A) are already converging on stable versions. The next 12 months will see production hardening, security improvements, stateless MCP servers for horizontal scaling, better A2A federation — rather than new fundamental designs. The tool-calling and task-coordination layers are largely solved.
The transport layer is 18 to 24 months behind. Expect a period of implementation diversity as teams experiment with different approaches to peer-to-peer (P2P) agent networking, followed by consolidation around a small number of implementations once empirical data on performance and reliability accumulates. The IETF and W3C standardization tracks will likely produce something in the 2027-2028 window, by which time one or two open-source implementations will have accrued enough production deployments to establish de facto standards ahead of the formal specification.
For engineering leaders making architecture decisions today, the practical implication is layered adoption. The application-layer protocols are stable enough to build on. MCP adoption now is low-risk. A2A adoption for multi-agent coordination is reasonable with the expectation that the protocol will evolve. The transport layer is where you either build something custom and plan to replace it, or you evaluate early implementations knowing the space is still moving.
The teams that will have the most leverage when the transport layer stabilizes are the ones that designed their agent systems with a clean separation between application semantics (MCP, A2A) and transport (whatever sits below). Clean separation is cheap to implement now and expensive to retrofit later, a lesson the microservices era taught anyone who tried to add observability or circuit breaking to systems that had none.
Philip Stayetski is a co-founder of Vulture Labs.
Agentic AI is now a core part of the engineering process, driving massive execution leverage and helping us generate more code than ever before. Yet, a difficult question I’ve increasingly heard from business leaders is: if we’re shipping code faster than ever, why aren’t our products improving at the same rate?
The reason is that writing code was never the rate limiter. Defining the right requirements, integrating with complex systems, and maintaining software under real-world conditions has always been the hard part. And when agents flood an organization with lots of new code, the hard part only gets harder. Agents compress execution time. They do not compress ambiguity, accountability, or operational complexity.
As AI-generated code scales, human review is becoming a massive new bottleneck, and engineers are losing the context needed to catch agent mistakes. The companies that understand this will move forward deliberately and even create new roles because of AI. The ones that don’t will default to a simpler, far more destructive conclusion: Reduce headcount and increase AI spend.
Irreversible structural decisions demand caution, precisely because the technology is moving so fast. Enterprise engineering leaders need a deliberate playbook to navigate the chaos. Here’s how to start:
Protect the downside — secure the infrastructure and cap the financial bleeding.
Treat governance as a tier-one risk: The pressure to integrate AI is real, but giving teams the freedom to experiment without a centralized structure creates fragmented processes, duplicated work, and runaway costs. Organizations will need to establish shared standards while still allowing teams to adapt and explore within defined boundaries. This means treating agent configuration like production infrastructure — versioning, reviewing, and testing prompts and skills before rolling them out gradually.
Enforce least privilege for non-human actors: Never allow an agent to simply inherit the full permissions of its human operator. Human engineers are granted broad access because they possess contextual judgment and bear ultimate accountability. Deploying agents with human-level access without careful consideration introduces an accountability gap into your systems. Implement strict separation between read and write/execute access, and mandate human-in-the-loop approval gates for destructive or production-altering actions. As agents transition from suggesting code to autonomously executing tasks, they must be rigorously incorporated into your security model.
Watch your wallet: Protect your overall AI budget by enforcing quotas and rate limits for both engineering and production. Cautionary tales are increasingly common: Uber capped its AI spend after burning its 2026 budget by April, and, according to Axios, an unnamed company incurred a staggering $500 million Anthropic bill in a single month due to runaway agentic loops.
Build the engine: Choose the right models and measure their success.
Go multi-model and multi-vendor: No single model excels at every task. It’s important to precisely characterize the behavior and performance boundaries across models to understand where each excels, routing specific tasks to the systems best equipped to handle them. Standardizing on a single vendor or model sacrifices capabilities and introduces a critical single point of failure. No organization should absorb that level of concentration risk in its core engineering function.
Pay for the frontier: Treat AI as engineering leverage, not just another SaaS expense. Pay for premium frontier models that deliver the highest quality output and reduce costly rework. Ultimately, the cheapest model isn’t the one with the lowest token price — it’s the one that maximizes efficiency while minimizing your downstream risk.
Measure what actually matters: Deployments, lines of code, and pull requests were never good metrics for productivity, and with AI, they are actively misleading. Instead, aim for metrics that are attached to business outcomes (feature adoption, retention) and engineering durability (change failure rate, escaped defects, code survival over time). For AI efficiency, measure task success per dollar and rework time. Token counts are convenient for leaderboards but they cannot tell you if the tokens were well spent.
Realign your human capital to manage the new bottleneck.
Shift engineers from syntax to systems: As agents handle the bulk of code generation, human review and architectural alignment are the new bottlenecks. Organizations must deliberately upskill their workforce to transition from syntax-writers to systems-thinkers and agent-managers. Engineers need the training and mandate to guide agentic processes, manage complex cross-system integrations, and hold the overarching architectural vision that agents can struggle to maintain.
Redefine performance and incentives: When an individual engineer can generate the output of a former squad, traditional metrics like story points or sprint velocity can become ineffective overhead. Consider realigning your evaluation frameworks to better reward expanded business impact, cross-system reliability, and effective agent orchestration. If you want systems-thinkers who cover more strategic surface area, are willing to explore and take risks, and build products in a durable way, you must reward them for higher level impact, not sheer volume of output.
Don’t cut headcount before your strategy adapts: If you haven’t integrated agentic workflows, measured augmented output in production, and reworked your roadmap around faster execution, you do not actually know whether your needs and capabilities align. Cutting headcount before establishing that baseline isn’t discipline — it’s blindness. The goal is not simply smaller teams, but teams capable of covering more strategic surface area.
AI is not a replacement for engineering judgment; it is a force multiplier for it. In well-structured systems, it safely accelerates delivery. In poorly understood systems, it accelerates failure. We are already seeing the fallout: Outages, rising technical debt, and unexpected cost spikes driven by poorly governed adoption. These are operational failures, not theoretical risks.
The mistake organizations are now making isn’t adopting AI too slowly — it’s adopting it without understanding where it breaks.
For the C-suite, understanding this dynamic is no longer optional — it is the determining factor in how a business navigates this era. The challenge is that execution velocity is outpacing the industry’s ability to manage the consequences. We have handed engineering teams the ultimate power tool. The old adage demands that you measure twice and cut once. Instead, too many firms are opting to just cut.
Joe Bertolami is CTO and co-founder of Clifton AI.
Our system did one thing, and it did it well: It turned natural-language questions into API calls.
The users were analysts, account managers, and operations leads. They knew what data they needed, but assembling it manually meant pulling from four dashboards, two BI tools, and a Salesforce report builder. With our system, they typed the request in plain English. A request like “Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city” was translated into an API call that the system could act on:
json
{
“description”: “User requested sales volume for the given date range, here is the API call to get the response”,
“api_call”: “/api/sales_volume”,
“post_body”: {
“start_date”: “2026-01-01”,
“end_date”: “2026-03-31”,
“region”: “northeast”
}
}
The rest of the pipeline was conventional engineering. The system dispatched the call to the right backend — we had integrations with internal reporting portals, Salesforce, and several homegrown services — applied a large language model (LLM)(-generated JSON query to filter and shape the response, and delivered it via email, as a Drive document, or rendered as a chart in the browser.
By mid-2025, the system was generating several hundred reports a month. These reports were consumed by leadership and analysts and circulated to external stakeholders. It had become the default way most teams pulled ad-hoc data.
The contract between the LLM and the rest of the system was a structured JSON object as described in the above example.
json
{
“description”: “User requested sales volume for the given date range, here is the API call to get the response”,
“api_call”: “/api/sales_volume”,
“post_body”: {
“start_date”: “2026-01-01”,
“end_date”: “2026-03-31”,
“region”: “northeast”
}
}
We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident, and to 4.0 without incident. By the time Sonnet 4.5 shipped, we had grown complacent about the stability and predictability of LLMs in solving what we believed was a simple problem. Model upgrades had become routine, like bumping a minor version of a well-behaved library.
Then we rolled out 4.5. For a meaningful percentage of requests, the model began folding the contents of post_body into the description field. Two failure modes followed.
First, the filter parameters never reached the API. Our system read post_body as the source of truth for the request payload, and that field came back empty. The API call was made without the date range or region filter. Depending on the specific API being called, the backend either returned sales volume for all time or all regions or returned a 500 error.
Second, the model started asking clarifying questions in its response. This was new. Earlier versions always took a best-effort approach to an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, would sometimes respond with a question instead. Our system had no path for this. It had been built on the assumption that every model invocation would result in an API call. There was no human-in-the-loop component and no state to hold a partially completed request. This caused downstream systems to break in multiple ways.
We rolled back to 4.0. That was harder than it should have been: Between the 4.0 and 4.5 deployments, our team had added new API integrations, all of which were qualified against 4.5. Reverting the model meant requalifying every one of them against 4.0 under time pressure.
Software engineering rests on the ability to bound the effect of a change. When you upgrade a driver or library, you read the release notes to see whether to expect breaking changes. Unit tests circumscribe what could possibly have moved. You can leverage the following property: The system being changed is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The blast radius is bounded by construction.
LLM-backed systems break this assumption. The component that produces your output is not under your control. You cannot diff a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality on which your system depends.
This is what we mean by an infinite blast radius: a change whose downstream effects cannot be enumerated in advance because the input space (natural language) and the failure modes (anything the model might do differently) are both unbounded.
The post-mortem revealed that our prompt had always been under-specified. We had told the model to return a JSON object with three fields. We had described what each field was for. We did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields.
Earlier versions of the model inferred this constraint from context. Sonnet 4.5, evidently better at being “helpful” in its formatting choices, decided that inquiring for clarification or providing the request body in the description made the response more useful. From the model’s perspective, this was a reasonable interpretation of an ambiguous instruction. However, this violated the assumptions under which our system was built.
The bug was not in the model. The bug was in our assumption that the model would continue to fill in our specification gaps as it always had. Three successful upgrades had trained us to believe those gaps were safe.
Structured output modes and tool-use APIs would have caught this specific failure at the schema level. We weren’t using them for engineering reasons outside the scope of this article. But schemas only constrain syntax, not semantics. A schema cannot specify that a clarifying question shouldn’t appear in a system with no path for clarification, or that a date range should never silently default to all-time. Schemas solve the easier half of the problem.
The discipline that closes this gap is to treat the evaluation suite — not the prompt — as the formal specification of the system. The prompt is an implementation of the spec. The model is an interpreter. The evals are the spec itself, and any model or prompt change is valid if and only if it passes them.
In practice, an eval is a triple: An input, a property the output must satisfy, and a scoring function. For our system, the eval that would have caught the 4.5 regression looks roughly like this:
python
def test_description_contains_no_serialized_payload(response):
desc = response[“description”].lower()
forbidden = [“curl”, “post_body”, “{“, “http://”, “https://”]
assert not any(token in desc for token in forbidden), \
f”description leaked structured content: {response[‘description’]}”
A few hundred such properties, some written by hand for known-important invariants, some generated as regression tests from real production traffic, some scored by an LLM-as-judge for fuzzier qualities like tone, become a gate. Model upgrades and prompt changes should be treated as pull requests that must turn the suite green before they merge.
Evals are expensive to build and maintain. They drift as your product changes. LLM-as-judge scoring introduces its own variance in outcomes. And the suite can only catch failure modes you have thought to specify — you cannot eval your way to safety against a category of failure you have never imagined. We learned this lesson the hard way: Nobody on our team had ever written an assertion that said “the description field should not contain a curl command,” because nobody had thought the model would put one there.
Evals are not a silver bullet. They give you the ability to bound the blast radius of a change in the only way available when the underlying function is a black box: By densely sampling the input-output response you actually care about, and refusing to deploy when that behavior moves.
The engineering community has yet to develop a body of knowledge for writing effective evals. There are no widely accepted standards for what ‘coverage’ means in natural language input spaces. CI/CD systems were not built to gate probabilistic test outcomes. As agents take on more autonomous work — writing code, moving money, scheduling infrastructure changes — the gap between “the model passed our smoke tests” and “we know what this system will do in production” becomes the central engineering problem of the next several years.
The teams that close that gap will be the ones who stop treating evals as a quality-assurance afterthought and start treating them as the actual specification of what their system is.
Vijay Sagar Gullapalli is Founding AI Engineer at Adopt AI and a USPTO-patented inventor.
Sarat Mahavratayajula is a Senior Software Engineer at Sherwin-Williams.
In 2024, researchers from the University of Illinois found that GPT-4, when provided with a common vulnerabilities and exposures (CVE) description, could autonomously exploit 87% of a curated 15-vulnerability one-day dataset. Without the description, it could only exploit 7%. This provided a “margin of safety” for the industry because while AI could exploit known vulnerabilities, it could not discover them.
However, on April 7, Anthropic announced that Claude Mythos Preview had closed that margin, with the model autonomously discovering thousands of zero-day vulnerabilities across major operating systems and browsers. Separately, Mythos scored 83.1% on the CyberGym vulnerability reproduction benchmark. In one campaign targeting OpenBSD across 1,000 scaffold runs, the total compute cost was less than $20,000.
Exploitation timelines are collapsing. Langflow’s CVE-2026-33017 (CVSS 9.8) was exploited 20 hours after disclosure with no public proof-of-concept. Marimo’s CVE-2026-39987 (CVSS 9.3) was hit in 9 hours and 41 minutes.
The defensive infrastructure most organizations rely on wasn’t designed for this. Rapid7’s 2026 threat landscape report states that the median time from CVE publication to CISA’s known exploited vulnerabilities (KEV) listing is five days. Google’s M-Trends 2026 report found that exploitation is happening before a patch is even released. When the Langflow advisory was published, the first exploit arrived in 20 hours. When the Marimo advisory was published, it took under 10 hours.
The assumption that your patch window is safe because exploitation takes time is no longer true. Here are your building blocks.
Most vulnerability management programs still prioritize by CVSS score alone. CVSS quantifies a vulnerability’s “theoretical” severity without considering whether a vulnerability is being exploited in the wild or how quickly someone could weaponize it. A CVSS 8.8 vulnerability with a history of active exploitation (like Docker’s CVE-2026-34040) gets lower priority than a CVSS 9.8 vulnerability that may never be exploited in the wild.
A recent study validated against 28,377 real-world vulnerabilities offers a concrete replacement: A three-layer decision tree incorporating CISA KEV status, Exploit Prediction Scoring System (EPSS) scores, and CVSS, thus forming a singular prioritization filter.
|
Layer |
Data source |
Threshold |
Action |
SLA |
|
1. Active exploitation |
CISA KEV catalog |
Listed |
Immediate patching |
Hours |
|
2. Predicted exploitation |
EPSS via FIRST.org |
Score ≥ 0.088 |
Escalate to Tier 0 pipeline |
24 hours |
|
3. Severity baseline |
CVSS via NVD |
Score ≥ 7.0 |
Typical remediation |
Per policy |
Validated result: 18x efficiency gain, 85.6% coverage of exploited vulnerabilities, ~95% reduction in urgent remediation workload. All three data sources are open and free.
The described integration is entirely automatable. It’s possible to build a script to query the CISA KEV API, the EPSS API from FIRST.org, and the NVD, and have that script run against your asset inventory for every published CVE. The human in this process should remain in the loop as an approver, but not as the trigger.
Creating exploits quickly not only changes how patches are prioritized, but how controls are configured for all the agent-driven systems that now possess privileged credentials. Your authorization policies have not been assessed against the behavior of AI agents, and that is now a measurable risk. CVE-2026-34040 showed that Docker’s authorization plugin architecture silently bypasses every plugin when the request body exceeds 1MB. Common AuthZ plugins (OPA, Casbin, Prisma Cloud) are unaware of this type of bypass, which occurs in Docker’s middleware before the request reaches the plugin.
When Cyera demonstrated this vulnerability, they showed that an AI agent debugging infrastructure could infer the bypass path while completing a legitimate task, without any instruction to exploit anything.
The Internet Engineering Task Force (IETF) is working on authorization models for agents. The document draft-klrc-aiagent-auth-01, published in March by participants from AWS, Zscaler, Ping Identity, and OpenAI, proposes the use of the current Secure Production Identity Framework for Everyone (SPIFFE) and OAuth 2.0 for AI agents to obtain dynamically provisioned and short-lived credentials.
Separately, the IETF Agent Identity Protocol draft (draft-prakash-aip-00) reports that out of about 2,000 surveyed model context protocol (MCP) servers, none had authentication.
But these standards are months to years away from implementation. For now, security teams must proactively incorporate agent-level test scenarios for all authorization boundaries, such as oversized requests, burst frequency, and multi-step escalation of privileged requests.
In a survey conducted by CSA/Zenity and published on April 16, 53% of organizations said they had already seen cases where AI agents exceeded their intended permissions, and 47% experienced a security incident involving an agent.
When AI builder tools such as Flowise (CVE-2025-59528, CVSS 10.0), Langflow, or n8n become compromised, the blast radius extends far beyond the host. These tools contain API keys to frontier models, database credentials, vector store tokens, and OAuth tokens to business systems. A compromised AI builder host is not just a single-system breach. It is a credential harvest that unlocks authenticated access to every connected service.
Without credential dependency maps for each AI tool host, incident response for agent compromise is guesswork. For every instance, document each credential, the extent of its access, and the relevant credential rotation process. Also begin migrating static API keys to short-lived tokens where downstream services allow.
1. Deploy the three-layer KEV-EPSS-CVSS filter
Substitute CVSS-only prioritization according to the table above. Automate the collection of data from all three APIs as part of a scheduled script against your asset inventory. Desired outcome: 18 times more efficient, 85.6% coverage of exploited vulnerabilities, 95% reduction in urgent remediation workload.
2. Implement event-driven patching for Tier 0 services.
Determine which services fall under the critical exposure tier: Services exposed directly to internet users, AI builder hosts, and container orchestration control plane. Trigger event-driven patching on a CVE publication instead of waiting for the next maintenance window for this tier.
Goal: deploy patch to canary within four hours of a CVE being declared critical. Use the CISA KEV and EPSS feeds to trigger event-driven patching. In situations where it is impossible to meet the goal of four-hour patching because of legacy dependencies, change-freeze windows, or rollback risk, immediately apply compensating controls such as removing internet exposure to the vulnerable service, rotating credentials for the vulnerable service, disabling affected functionality of the service (if applicable), and identifying an exception owner for the exposure until a patch can be deployed.
It is not acceptable to allow unbounded exposures for extended periods while awaiting a maintenance window.
3. Test authorization boundaries at agent scale.
Create test cases for every API that AI agents may communicate with via AuthZ policies. Specifically, include test cases for requests exceeding 1MB, 5MB, and 10MB body sizes. This includes test cases for burst rate > 100 requests per second and test cases for unusual parameter combinations (privileged flags, host mounts, capability additions). Additionally, patch to Docker Engine 29.3.1 to fix CVE-2026-34040.
4. Credential blast radius mapping for all AI builder hosts.
Document each credential for each Langflow, Flowise, n8n, and custom AI pipeline instance. Classify each credential by its lifespan (static key vs. short-lived token). Identify what each credential can access. Set up alerts for anomalous IP or identity for any credential access.
5. Shadow AI discovery scan for this week.
According to CSA data, there is a greater than 50% chance that your agents have exceeded their expected boundaries. Check your Security Information and Event Management (SIEM) and network monitoring tools for communications to the default ports of the AI builder: Langflow 7860, Flowise 3000, and n8n 5678. Any unauthorized instances are an unmonitored attack surface.
AI agents are emerging, and the standards bodies are responding. The IETF has multiple drafts related to agent authentication and authorization. The Coalition for Secure AI has published its MCP Security taxonomy and Secure-by-Design principles.
But these standards move at standards-body speed, and the exploit window is now measured in hours. Organizations that implement the three-layer filter and event-driven patching this quarter will have a measurable reduction in exposure. Those who wait will be running calendar-based patch cycles against an adversary that operates in less than 20 hours.
Nik Kale is a principal engineer specializing in enterprise AI platforms and security
Over the past two decades, technical debt meant outdated architecture, messy code, and poorly maintained documentation. That definition is no longer sufficient in the AI era, where failure modes are more subtle and often non-linear. AI systems are introducing new layers of technical debt that live across prompts, models, and data dependencies — making these layers less visible, harder to measure, and often more dangerous than traditional debt.
The complexities of AI systems and their associated failures have been well documented. A 2025 MIT study found that 95% of AI projects fail to reach production or deliver value. A similar study by S&P Global Market Intelligence found that 42% of businesses scrapped multiple AI initiatives in 2025 — a sharp increase from 17% the previous year. Various reasons are cited for these failures, but most of them point to poorly designed and implemented systems that are complex to manage and have multiple hard-to-monitor failure points, leading to a rapid accumulation of AI debt.
Traditional technical debt was localized to the codebase, and bugs were usually easily reproducible. Consequently, bugs could be easily identified during tests and fixed through rearchitecting the codebase. However, AI debt is much more distributed, manifesting across prompts, models, data pipelines, and all associated infrastructure. It is also more intermittent: Due to the probabilistic nature of AI, systems do not always respond the same way, leading to intermittent failures. This makes it much more challenging to identify risks during testing, and also creates a need for more continuous monitoring even post-deployment to prevent gradual drift and worsening performance.
AI debt typically manifests across four new forms, each of which comes with its own set of risks.
Prompt debt is the most visible of these. A modern version of ‘spaghetti code,’ this can include undocumented prompt tweaks, accumulated ‘quick-fix’ prompts that lead to inconsistencies, neglected version control of prompts, and ‘prompt stuffing’ (the cramming of extraneous data or context directly into AI prompts). All these combine to make prompts a form of untyped, untested code without any version control, leading to increased brittleness and vulnerabilities.
Model dependency debt is another increasingly common form of AI debt. Most enterprises now depend on a mixture of external models developed by leading foundation model providers; applications and agents are built on top of API calls to these models. Consequently, application logic now depends on models that are external to the core system, and that cannot be clearly controlled. As models update, performance varies and reproducibility is lost — prompts tuned for one model may fail or perform poorly when switched to another model, whether an update from the same provider or from another provider.
Most enterprise AI deployments today use retrieval-augmented generation (RAG), which pulls in additional context from enterprise data repositories. Retrieval debt is a consequence of these repositories having messy data, duplicated documents, and outdated information. This causes AI to return technically correct answers that are outdated and no longer relevant, causing downstream failures. Unlike hallucinations, these are harder to detect because they were correct, perhaps even until recently, and hence look correct to any tester.
Evaluation debt reflects the lack of standardization in testing and monitoring for AI models and applications. While AI benchmarks exist, they tend to focus on narrow tests and reflect point-in-time results. Most enterprises lack consistent testing standards, ground truth datasets, and real-time monitoring of deployments; there is no equivalent yet of continuous integration /continuous delivery (CI/CD) for prompts. As a consequence, CIOs and CTOs do not have clear visibility into model performance and cannot track improvements or worsening of models.
All of these are in addition to traditional forms of technical debt, which still manifest across the tools and systems that AI applications and agents interact with, read from, or write to. A rapid increase in the adoption of AI-generated code (often deployed without inadequate testing) is further aggravating inconsistencies within, and poor maintainability of traditional codebases.
The new forms of AI debt combine with these earlier forms of technical debt to compound rapidly and create large-scale risks that can cause catastrophic failure of entire enterprise deployments. Solving for these risks is made even more challenging by the distributed nature of AI ownership – most systems span engineering, product, data, and business teams, leading to unclear accountability when an error is identified.
As a result, these risks manifest in the form of escalating compute costs, inaccuracies in AI outputs, and increasing exceptions that need to be handled by humans — leading to projects often stalling and failing due to unclear return-on-investment stories and a lack of trust from users.
AI debt will not be solved by ‘better’ models — failure rates remain high despite models already having high accuracy. The solution to AI debt requires better system design, integration, controls, and changes in organizational culture.
First, prompts need to be treated as code. This involves careful version control, documentation, and rigorous testing both pre- and post-deployment for all possible prompt configurations. Best practices from the traditional world of coding — such as the use of smaller prompt blocks instead of large prompt-stuffed walls, or reducing the use of hard-coded parameters — can also help mitigate AI debt.
Second, evaluation needs to be built into the entire AI infrastructure stack. Continuous evaluation pipelines need to be established and must reflect a wide variety of metrics measuring both technical and business-aligned metrics. In addition, AI observability systems should be integrated to monitor output quality, failure rates, model drift, and data drift.
Third, explainability should be included by default in all AI results to make up for limited reproducibility. Data lineage, models used, and the steps followed should be clearly traceable so as to allow auditability of results and correction in case of any systemic errors.
This requires explicit AI debt reduction programs and associated budgets, similar to earlier waves of investment in security or in cloud modernization. These need to be driven at a CXO level by key leaders to prevent costly rework later.
Enterprise AI deployments are not just static code; they are living systems that interact with the entire enterprise stack. As a result, the defining challenge in an agentic enterprise will not be building or deploying intelligent systems, it will be maintaining these systems to ensure continued reliability during real-world operation.
Enterprises that seek to proactively identify and mitigate AI debt from the design phase itself are the likeliest to build sustainable AI platforms that deliver significant long-term productivity boosts across the organization.
Vikram is a principal at Cota Capital, where he invests in early-stage enterprise tech and deep tech companies.
There is a category of production incident that engineering teams are not tracking yet — because it doesn’t fit any existing postmortem template.
The agent initiated an action. The action was technically correct given the agent’s context. The context was incomplete. The infrastructure cascaded. And, by the time the incident review happened, three teams were arguing about whether it was an agent failure or an infrastructure failure, because the frameworks for thinking about these two things have never been connected.
The scale of this exposure is no longer theoretical. Seventy-nine percent of organizations now have some form of AI agent in production, with 96% planning expansion. Gartner predicts 33% of enterprise software will include agentic AI by 2028, but separately warns that 40% of those projects will be canceled due to poor risk controls.
What neither statistic captures is the failure mode happening between those two numbers: Agents that are running, that are not canceled, and that are quietly generating infrastructure events no one has categorized as risk.
I’ve spent six years building infrastructure automation systems at enterprise scale, first at Cisco (leading AI-driven lifecycle platforms deployed across 20-plus global enterprise customers), then at Splunk (designing AI-assisted root cause analysis and observability workflows across thousands of enterprise environments).
During that time I also filed a patent on intent-based chaos engineering methodology. And across all of it, I kept watching organizations make the same structural mistake: Treating autonomous agents and chaos engineering as separate disciplines. They are not. They are the same discipline, and the gap between them is quietly generating the next wave of major production incidents.
To understand why this matters, you need to understand what’s actually broken in how enterprises govern chaos today, before you add agents to the picture.
Most mature engineering organizations have invested in chaos engineering programs. Game days, blast radius controls, SLO-gated experiments. When a human engineer initiates a chaos experiment, the sequence has a critical property: A human is making a judgment call about whether the system has capacity to absorb the perturbation right now. They check dashboards. They look at the error budget burn rate. They assess whether dependencies are stable. It’s imperfect and often intuitive, but there is at least a person in the loop asking the right question before anything runs.
When you introduce an autonomous remediation agent, one that can restart services, reroute traffic, scale resources, or modify configurations in response to detected anomalies, that question disappears. The agent sees an anomaly. The agent takes an action. The action is a chaos event. No SLO burn rate check. No blast radius calculation. No human judgment about whether right now is the right moment to introduce additional stress into a system that may already be under pressure from three other directions.
Here is the specific failure mode I have watched play out. A remediation agent detects elevated latency on a microservice and responds by restarting the service cluster; a reasonable action given its training data and its narrow view of the incident. What the agent doesn’t know: Three other services are in the middle of handling peak traffic. The shared connection pool is already at 87% utilization. A dependent database is running a background index rebuild. The restart triggers a thundering herd against the recovering service.
What started as a latency spike the agent was designed to fix becomes a cascade the agent was never designed to model. The blast radius of that agent action was not the service restart. It was everything downstream of the restart, in a system state the agent had no complete picture of.
Nobody’s chaos engineering program had tested for that specific combination. Nobody’s blast radius calculation had included the agent as an actor. Because we don’t think of agents as chaos injectors. We should.
According to the AI Incidents Database, reported AI-related incidents rose 21% from 2024 to 2025. That count almost certainly understates the actual exposure, because most organizations have no incident classification that captures an autonomous agent action as the initiating cause of a cascade. The incident gets logged as a service restart, a connection pool saturation, or a latency event. The agent is invisible in the postmortem.
The underlying problem is that enterprise systems have no shared language for absorb capacity — the real-time estimate of how much additional stress a system can take before it breaches its SLO commitments. Chaos engineering programs manage it implicitly, through human judgment and static thresholds that fire after a limit has already been crossed. Agents don’t manage it at all.
Through structured primary research with site reliability engineering (SRE) and platform engineering practitioners across organizations including Intuit and GPTZero, I’ve been developing a resilience budget model. The core idea is to treat absorb capacity as a continuously recomputed, consumable resource rather than a static threshold you try not to breach.
A resilience budget draws on four live signal classes.
SLO burn rate is the primary input, because it directly encodes the distance between current system behavior and the commitment that actually matters. If a system is burning its monthly error budget at five times the expected rate, the resilience budget is near zero regardless of what CPU utilization looks like.
P99 latency trend matters more than absolute latency, because a service trending upward over forty minutes tells you something different than a service that has been stable at the same absolute value.
Dependency saturation state is the most commonly missed signal; a chaos experiment or an agent action that assumes a shared connection pool is freely available when it’s sitting at 87% will produce failure modes that nobody designed for.
Application behavioral signals, session completion rates, API call pattern shifts, conversion degradation, and surface system stress earlier than infrastructure metrics do, because users feel the degradation before Prometheus reports it.
What makes this a budget rather than a threshold is that it is consumable. Every chaos experiment draws from the available capacity. Every agent action draws from it. In multi-team organizations where multiple experiments and multiple agents may be acting simultaneously, the budget is shared.
Without a shared ledger of consumption, two teams running experiments against overlapping dependencies produce a combined blast radius that neither team planned. Add autonomous agents acting completely outside the ledger, and the accounting collapses.
Several engineering organizations are now running experiments using large language models (LLMs) to generate chaos hypotheses from dependency graphs and incident postmortem corpora. The results are directionally useful. Language models surface plausible failure modes that experienced SREs recognize as worth testing, and they generate hypotheses faster than manual processes, particularly when working from rich postmortem history.
The limit is dependency graph staleness, and it is a hard limit. A hypothesis generated from a graph that doesn’t reflect last month’s service extraction, or a new shared library dependency added two sprints ago, will propose an experiment with incorrect blast radius assumptions. The problem is not that the model makes a mistake, it’s that the model doesn’t know it’s making one. It will be confidently incorrect about a system boundary that no longer exists, and in chaos engineering, confident incorrectness in production means an unplanned outage.
Stanford’s Trustworthy AI Research Lab found that model-level guardrails alone are insufficient: Fine-tuning attacks bypassed leading models in the majority of tested cases. The implication for chaos hypothesis generation is direct, a model that cannot reliably hold its own safety boundaries cannot be trusted to accurately model the blast radius of an action it has never seen in a dependency graph it has not verified.
When hypothesis generation draws instead from postmortem corpora, the staleness problem shrinks considerably. Postmortems describe failures that actually occurred in the system at a specific moment in time. The signal is inherently validated by production reality. This is the tractable near-term AI application in this space, and it is genuinely useful for organizations with mature incident documentation practices.
What AI cannot do, and should not be asked to do, is make the execution decision when signals are ambiguous. That judgment requires awareness of things that live entirely outside any monitoring system: Pending deployments that changed the dependency landscape an hour ago, on-call staffing levels on a holiday weekend, a customer commitment that makes any additional risk unacceptable until Monday.
A model without access to that context should not be making that call. This is not a temporary limitation pending a more capable model. It is a structural constraint of what machine observability can represent, and building an agent architecture that ignores it is building one that will eventually make a consequential decision with incomplete information — and no human in the loop to catch it.
The governance implication is straightforward to describe and harder to implement than it sounds. Every autonomous agent action that touches infrastructure needs to register against the same live signal layer that governs chaos experiments. The same SLO burn rates, latency trends, dependency saturation states that a human engineer would check before initiating an experiment should gate what an agent is permitted to do and when. If the resilience budget is below a defined floor, the agent waits or escalates. It does not act.
Agent actions also need to be modeled as experiments, not just logged as events. When an agent restarts a service, the question isn’t only whether the restart completed successfully. It’s whether the blast radius of that action was proportionate to the available absorb capacity, and what cascading effects it produced across dependencies. That is chaos engineering data. It belongs in the budget model, feeding the next decision the agent or the team needs to make.
And when signals are genuinely ambiguous, when the budget score is unclear, when a recent deployment has changed the topology in ways the agent’s context window doesn’t capture, when dependency states are in flux, the execution decision needs to go to a human. Not as a permanent limitation on agent autonomy, but as a hard engineering requirement for the current state of the technology.
A circuit breaker that hands ambiguous cases to a human is not a weakness in the agent architecture. It is the thing that makes the architecture trustworthy enough to actually run in production. Intent-based verification formalizes exactly this: Defining what correct agent behavior looks like before deployment, then continuously probing whether those boundaries hold under live system conditions.
The organizations that operate autonomous agents reliably at scale are not the ones with the most sophisticated models. They are the ones that understood, before something went badly wrong, that every agent action is a chaos event and built their governance layer accordingly.
The practical first step is unglamorous: Audit every autonomous agent currently touching infrastructure, map its action surface against your live SLO burn rate signals, and define explicit floor conditions below which the agent is required to wait or escalate. That audit will surface agents acting entirely outside your resilience accounting.
Most organizations running agents at scale today have several. Find them before production does.
Sayali Patil has spent 6-plus years at Cisco Systems and Splunk building the reliability and automation systems that keep enterprise AI infrastructure running at scale.
Retrieval-augmented generation (RAG) has become the de facto standard for grounding large language models (LLMs) in private data. The standard architecture — chunking documents, embedding them into a vector database, and retrieving top-k results via cosine similarity — is effective for unstructured semantic search.
However, for enterprise domains characterized by highly interconnected data (supply chain, financial compliance, fraud detection), vector-only RAG often fails. It captures similarity but misses structure. It struggles with multi-hop reasoning questions like, “How will the delay in Component X impact our Q3 deliverable for Client Y?” because the vector store doesn’t “know” that Component X is part of Client Y’s deliverable.
This article explores the graph-enhanced RAG pattern. Drawing on my experience building high-throughput logging systems at Meta and private data infrastructure at Cognee, we will walk through a reference architecture that combines the semantic flexibility of vector search with the structural determinism of graph databases.
Vector databases excel at capturing meaning but discard topology. When a document is chunked and embedded, explicit relationships (hierarchy, dependency, ownership) are often flattened or lost entirely.
Consider a supply chain risk scenario. While this is a hypothetical example, it represents the exact class of structural problems we see constantly in enterprise data architectures:
Structured data: A SQL database defining that Supplier A provides Component X to Factory Y.
Unstructured data: A news report stating, “Flooding in Thailand has halted production at Supplier A’s facility.”
A standard vector search for “production risks” will retrieve the news report. However, it likely lacks the context to link that report to Factory Y’s output. The LLM receives the news but cannot answer the critical business question: “Which downstream factories are at risk?”
In production, this manifests as hallucination. The LLM attempts to bridge the gap between the news report and the factory but lacks the explicit link, leading it to either guess relationships or return an “I don’t know” response despite the data being present in the system.
To solve this, we move from a “Flat RAG” to a “Graph RAG” architecture. This involves a three-layer stack:
Ingestion (The “Meta” Lesson): At Meta, working on the Shops logging infrastructure, we learned that structure must be enforced at ingestion. You cannot guarantee reliable analytics if you try to reconstruct structure from messy logs later. Similarly, in RAG, we must extract entities (nodes) and relationships (edges) during ingestion. We can use an LLM or named entity recognition (NER) model to extract entities from text chunks and link them to existing records in the graph.
Storage: We use a graph database (like Neo4j) to store the structural graph. Vector embeddings are stored as properties on specific nodes (e.g., a RiskEvent node).
Retrieval: We execute a hybrid query:
Vector scan: Find entry points in the graph based on semantic similarity.
Graph traversal: Traverse relationships from those entry points to gather context.
Let’s build a simplified implementation of this supply chain risk analyzer using Python, Neo4j, and OpenAI.
We need a schema that connects our unstructured “risk events” to our structured “supply chain” entities.
In this step, we assume the structural graph (suppliers -> factories) already exists. We ingest a new unstructured “risk event” and link it to the graph.
This is the core differentiator. Instead of just returning the top-k chunks, we use Cypher to perform a vector search to find the event, and then traverse to find the downstream impact.
The output: Instead of a generic text chunk, the LLM receives a structured payload:
[{‘issue’: ‘Severe flooding…’, ‘impacted_supplier’: ‘TechChip Inc’, ‘risk_to_factory’: ‘Assembly Plant Alpha’}]
This allows the LLM to generate a precise answer: “The flooding at TechChip Inc puts Assembly Plant Alpha at risk.”
Moving this architecture from a notebook to production requires handling trade-offs.
Graph traversals are more expensive than simple vector lookups. In my work on product image experimentation at Meta, we dealt with strict latency budgets where every millisecond impacted user experience. While the domain was different, the architectural lesson applies directly to Graph RAG: You cannot afford to compute everything on the fly.
Vector-only RAG: ~50-100ms retrieval time.
Graph-enhanced RAG: ~200-500ms retrieval time (depending on hop depth).
Mitigation: We use semantic caching. If a user asks a question similar (cosine similarity > 0.85) to a previous query, we serve the cached graph result. This reduces the “graph tax” for common queries.
In vector databases, data is independent. In a graph, data is dependent. If Supplier A stops supplying Factory Y, but the edge remains in the graph, the RAG system will confidently hallucinate a relationship that no longer exists.
Mitigation: Graph relationships must have Time-To-Live (TTL) or be synced via Change Data Capture (CDC) pipelines from the source of truth (the ERP system).
Should you adopt Graph RAG? Here is the framework we use at Cognee:
Use vector-only RAG if:
The corpus is flat (e.g., a chaotic Wiki or Slack dump).
Questions are broad (“How do I reset my VPN?”).
Latency < 200ms is a hard requirement.
Use graph-enhanced RAG if:
The domain is regulated (finance, healthcare).
“Explainability” is required (you need to show the traversal path).
The answer depends on multi-hop relationships (“Which indirect subsidiaries are affected?”).
Graph-enhanced RAG is not a replacement for vector search, but a necessary evolution for complex domains. By treating your infrastructure as a knowledge graph, you provide the LLM with the one thing it cannot hallucinate: The structural truth of your business.
Daulet Amirkhanov is a software engineer at UseBead.