AI agents are entering their rebuild era as enterprises confront the reliability problem

As enterprise AI agents move into production, organizations are confronting a growing reliability problem. Many teams are discovering that LLM performance alone does not determine whether agents succeed in production. Long-running AI workflows must survive crashes, preserve state, recover from failures, manage inference costs, and coordinate across APIs, tools, and enterprise systems.

After a first wave focused on rapid deployment, organizations now need to revisit those first-generation implementations, and redesign early agent architectures around workflow orchestration, observability, governance, and recovery, said Preeti Somal, Senior VP Engineering at Temporal Technologies, during the latest AI Impact Series event in New York.

“We do have a lot of customers that come to us where they’re building version 2.0 of the same agent,” Somal said. “They had to move really fast, but they didn’t take care of the plumbing. Things crash and burn, and then they’re back to rebuilding with the reliable foundation.”

For workflow orchestration company Temporal, whose infrastructure predates the current wave of agentic AI, the shift reflects a broader enterprise realization: production AI systems require durable execution, state management, visibility into workflows, and mechanisms to recover when models or downstream systems fail.

Agentic AI has supercharged familiar engineering problems

“These patterns aren’t necessarily new,” Somal said. ” AI just supercharges them.”

Agentic systems introduce additional complexity because they often involve long-running, multi-step processes spanning multiple services, models, APIs, and tools. A single workflow might call several large language models, access retrieval systems, trigger external applications, and manage state over hours or days. The engineering questions, Somal said, often emerge only after deployment.

“People will write agents but haven’t thought about what happens if the agent crashes,” she said. “Am I going to need to run the entire agent flow again?”

For enterprises operating under cost constraints, the answer matters. Restarting workflows after failures can multiply inference expenses, increase latency, and create poor customer experiences.

Somal compared the current moment to an earlier period in enterprise cloud adoption when organizations went straight to migrating workloads before considering that they needed to redesign underlying architectures if they wanted these workloads to weather the long-term.

“This rush to do AI in a world where you haven’t even modernized your application reminds me a little bit of that lift-and-shift that happened in the cloud,” she said. “Everybody realized you’re spending more money on cloud and we haven’t gotten value there.”

Why long-running agents force a new architecture

Enterprise workflows increasingly involve agents executing over long windows, sometimes spanning many hours while interacting with tools and systems. Reliability challenges compound when workflows persist over time, and it impacts both state and memory, two ideas that are often treated interchangeably in AI conversations.

State concerns workflow execution. It includes where an agent is in a process, which actions have already completed, and where recovery should resume after failure. Memory or context captures information an agent carries forward across interactions or tasks.

“The state of the agent is around what step and what actions have been performed, and if something crashes, where do you want to recover from, versus the context and memory piece,” Somal explained.

That distinction becomes increasingly important when enterprises begin moving beyond simple chatbot interactions toward longer-running business processes. Somal pointed to a healthcare example involving customer Abridge, where workflows process physician visits through multiple stages, including audio processing, summarization, model calls, and after-visit generation.

“There’s not just one piece to that flow,” Somal said. “Taking videos and slicing that, taking summaries, calling the LLMs, generating the after-visit summary, all of that is being orchestrated.”

The implication for enterprises is that successful agents increasingly depend on systems that can survive interruptions, coordinate across services, and maintain continuity over time.

The rise of the deterministic spine

A useful framework for enterprise AI design is the deterministic spine, Somal said, which is how they think about Temporal’s role.

“It is denoting the path you want to take,” she said. “It is calling the brain, but if the brain doesn’t respond, it will call it again. If the brain responds but the next step is going to fail, it will pick up from where that failure happened.”

In this framing, the language model acts as a probabilistic system producing variable outputs, while orchestration software maintains execution reliability around it. And the concept matters because enterprise systems increasingly require consistency even when models remain non-deterministic. A procurement workflow, healthcare summary, customer support escalation, or compliance process cannot simply fail silently because a model call timed out or an external dependency crashed.

“What you care most about is making sure that you can recover and that you’re not paying the token tax if something goes wrong,” Somal said.

Reliability, visibility, and the economics of token spend

As enterprise leaders evaluate AI ROI, cost visibility has become a growing concern. Long-running agents frequently make multiple model calls across complex workflows, which can create opaque spending patterns. Somal described one operational advantage of orchestration as visibility into where costs accumulate. Because workflows are observable step-by-step, teams can see where tokens are being consumed across an agent process.

“You’ve got visibility into that entire flow in a single pane of glass,” she said. “You can now see where you’re spending the tokens in an agent that is multiple steps and calling multiple different systems.”

Workflow recovery also shapes cost efficiency. Without durable orchestration, a late-stage failure can force organizations to rerun an entire process from the beginning, including all prior model calls. Somal said systems designed around recovery can resume execution from the point of interruption.

“You pick up from where the crash happened,” she said. “We save you the cost of running the agent from step one again.”

Enterprises need to build paved paths and enlist partner expertise

Governance concerns are another emerging pattern as agentic AI takes hold. Rather than adopting fully managed agent systems wholesale, Somal said enterprises increasingly want standardized internal frameworks that provide guardrails while preserving flexibility, and implementing necessary features like governance controls, model selection policies, identity systems, cost management, and observability.

“The enterprises are looking at building these paved paths,” she said. “Taking something off the shelf is maybe not going to work because there are all of these other requirements.”

As organizations revisit first-generation deployments, challenges like this increasingly look less like a model problem and more like a systems engineering problem, and Temporal is positioned to help enterprises take this next step in part because for many organizations, it already existed as part of broader modernization programs before AI became a strategic priority.

“Temporal is already in the enterprise,” Somal said. “Taking that and extending that to AI and agent platforms feels very natural.”

Researchers automated LLM reasoning strategy design and cut token usage by 69.5%

Test-time scaling (TTS) has emerged as a proven method to improve the performance of large language models in real-world applications by giving them extra compute cycles at inference time. However, TTS strategies have historically been handcrafted, relying heavily on human intuition to dictate the rules of the model’s reasoning. 

To address this bottleneck, researchers from Meta, Google, and several universities have introduced AutoTTS, a framework that automatically discovers optimal TTS strategies. This automated approach allows enterprise organizations to dynamically optimize compute allocation without manually tuning heuristics. 

By implementing the optimal strategies discovered by AutoTTS, organizations can directly reduce the token usage and operational costs of deploying advanced reasoning models in production environments. In experimental trials, AutoTTS managed inference budgets efficiently, successfully reducing token consumption by up to 69.5% without sacrificing accuracy.

The manual bottleneck in test-time scaling

Test-time scaling enhances LLMs by granting them extra compute when generating answers. This extra compute allows the model to generate multiple reasoning paths or evaluate its intermediate steps before arriving at a final response. 

The primary challenge for designing TTS strategies is determining how to allocate this extra computation optimally. Historically, researchers have designed these strategies manually, relying on guesswork to build rigid heuristics. Engineers must hypothesize the rules and thresholds for when a model should branch out into new reasoning paths, probe deeper into an existing path, prune an unpromising branch, or stop reasoning altogether. 

Because this manual tuning process is constrained by human intuition, a vast amount of possible approaches remain unexplored. This often results in suboptimal trade-offs between model accuracy and computing costs.

Current TTS algorithms can be mapped to a width-depth control space — “width” being the number of reasoning branches explored, “depth” being how far each develops. Self-consistency (SC) samples a fixed number of trajectories and majority-votes the answer. Adaptive-consistency (ASC) saves compute by stopping early once a confidence threshold is hit. Parallel-probe takes a more granular approach, pruning unpromising branches while deepening the rest. All three are hand-crafted, and that’s the constraint AutoTTS is designed to break.

While some more advanced methods employ richer structures like tree search or external verifiers, they all share one key characteristic: they are meticulously hand-crafted. This manual approach restricts the scope of strategy discovery, leaving a massive portion of the potential resource-allocation space untouched.

Automating strategy discovery with AutoTTS

AutoTTS reframes the way test-time scaling is optimized. Instead of treating strategy design as a human task, AutoTTS approaches it as an algorithmic search problem within a controlled environment. 

This framework redefines the roles of both the human engineer and the AI model. Rather than hand-crafting specific rules for when an LLM should branch, prune, or stop reasoning, the engineer’s role shifts to constructing the discovery environment. The human defines the boundaries, including the control space of states and actions, optimization objectives balancing accuracy versus cost, and the specific feedback mechanisms. 

An explorer LLM, such as Claude Code, designs the strategy. This explorer acts as an autonomous agent that iteratively proposes TTS “controllers.” These controllers are code-defined policies or algorithms that dictate how an AI model allocates its computational budget during inference. The explorer tests and refines these controllers based on feedback until it discovers an optimal resource-allocation policy. 

To make this automated search computationally affordable, AutoTTS relies on an “offline replay environment.” If the explorer LLM had to invoke a base reasoning model to generate new tokens every time it tested a new strategy, the compute costs would be astronomical. Instead, it relies on thousands of reasoning trajectories pre-collected from the base LLM. These trajectories include “probe signals,” which are intermediate answers that help the controller evaluate progress across different reasoning branches. 

During the discovery loop, the explorer agent proposes a controller and evaluates it against this offline data. The agent observes the execution traces of the proposed controller that show it allocated compute over time. By analyzing these traces, the agent can diagnose specific failure modes, such as noting if a controller pruned branches too aggressively in a specific scenario. This provides an advantage over just viewing a final result. The agent then iteratively rewrites its code to improve the accuracy-cost tradeoff. 

Inside the AI-designed controller

Because the explorer agent is not constrained by human intuition, it can discover highly coordinated, complex rules that a human engineer would likely never hand-code. One optimal controller discovered by AutoTTS, named the Confidence Momentum Controller, leverages several non-obvious mechanisms to manage compute:

  • Trend-based stopping: Hand-crafted strategies often instruct the model to stop reasoning once it hits a certain instantaneous confidence threshold. The AutoTTS agent discovered that instantaneous confidence can be misleading due to temporary spikes. Instead, the controller tracks an exponential moving average (EMA) of confidence and only stops if the overall confidence level is high and the trend is not actively declining.

  • Coupled width-depth control: Manually designed algorithms usually treat the “widening” of new reasoning paths and the “deepening” of current paths as separate decisions. AutoTTS discovered a closed feedback loop where the two actions are linked. If the confidence of the current branches stalls or regresses, the controller automatically triggers the spawning of new branches.

  • Alignment-aware depth allocation: Instead of giving all active reasoning branches an equal computation budget, the controller dynamically identifies which branches agree with the current leading answer. It then gives those branches priority “bursts” of extra computation. This concentrates the computational budget on the emerging consensus to quickly verify if it is correct.

Cost savings and accuracy gains in real-world benchmarks

To test whether an AI could autonomously discover a better test-time scaling strategy, researchers set up a rigorous evaluation framework. The core experiments were conducted on Qwen3 models ranging from 0.6B to 8B parameters. The researchers also tested the system’s ability to generalize on a distilled 8B version of the DeepSeek-R1 model. 

The explorer AI agent was initially tasked with discovering an optimal strategy using the AIME24 mathematical reasoning benchmark. This discovered strategy was then tested on two held-out math benchmarks, AIME25 and HMMT25, as well as the graduate-level general reasoning benchmark GPQA-Diamond. 

The AutoTTS discovered controller was pitted against four manually designed test-time scaling algorithms in the industry. These baselines included Self-Consistency with 64 parallel reasoning paths (SC@64), Adaptive-Consistency (ASC), Parallel-Probe, and Early-Stopping Self-Consistency (ESC). ESC is a hybrid approach that generates trajectories in parallel and stops early when an answer seems stable.

When set to a balanced, cost-conscious mode, the AutoTTS-discovered controller reduced total token consumption by approximately 69.5% compared to SC@64. At the same time, the controller maintained the same average accuracy across the four Qwen models. When the inference budget was turned up, AutoTTS pushed peak accuracy beyond all handcrafted baselines in five out of eight test cases.

This efficiency translated to other tasks. On the GPQA-Diamond benchmark, the balanced AutoTTS variant slashed the inference token cost from 510K tokens down to just 151K tokens, while slightly improving overall accuracy. On the DeepSeek model, AutoTTS achieved the highest overall accuracy on the HMMT25 benchmark while cutting the token spend nearly in half.

For practitioners building enterprise AI applications, these experiments highlight two major operational benefits:

  • Raising peak performance: AutoTTS doesn’t just save money on token consumption. It actively raises the peak attainable performance of the base model. The AI-designed controller is remarkably good at detecting noisy or unproductive reasoning branches on the fly and continuously redirecting its compute budget toward the branches generating the most useful reasoning signals.

  • Cost-effective custom development: Because the framework relies on an offline replay environment, the entire discovery process cost only $39.90 and took 160 minutes. For enterprise teams, that means optimized reasoning strategies tailored to proprietary models and internal tasks are now within reach — without a dedicated research budget.

Both the AutoTTS framework and the Confidence Momentum Controller are available on GitHub; the CMC can be used as a drop-in replacement for other TTS controllers.

AI agents are quietly generating chaos engineering failures enterprises don’t track yet

There is a category of production incident that engineering teams are not tracking yet — because it doesn’t fit any existing postmortem template.

The agent initiated an action. The action was technically correct given the agent’s context. The context was incomplete. The infrastructure cascaded. And, by the time the incident review happened, three teams were arguing about whether it was an agent failure or an infrastructure failure,  because the frameworks for thinking about these two things have never been connected.

The scale of this exposure is no longer theoretical. Seventy-nine percent of organizations now have some form of AI agent in production, with 96% planning expansion. Gartner predicts 33% of enterprise software will include agentic AI by 2028, but separately warns that 40% of those projects will be canceled due to poor risk controls.

What neither statistic captures is the failure mode happening between those two numbers: Agents that are running, that are not canceled, and that are quietly generating infrastructure events no one has categorized as risk.

I’ve spent six years building infrastructure automation systems at enterprise scale, first at Cisco (leading AI-driven lifecycle platforms deployed across 20-plus global enterprise customers), then at Splunk (designing AI-assisted root cause analysis and observability workflows across thousands of enterprise environments).

During that time I also filed a patent on intent-based chaos engineering methodology. And across all of it, I kept watching organizations make the same structural mistake: Treating autonomous agents and chaos engineering as separate disciplines. They are not. They are the same discipline, and the gap between them is quietly generating the next wave of major production incidents.

The judgment call that agents skip

To understand why this matters, you need to understand what’s actually broken in how enterprises govern chaos today,  before you add agents to the picture.

Most mature engineering organizations have invested in chaos engineering programs. Game days, blast radius controls, SLO-gated experiments. When a human engineer initiates a chaos experiment, the sequence has a critical property: A human is making a judgment call about whether the system has capacity to absorb the perturbation right now. They check dashboards. They look at the error budget burn rate. They assess whether dependencies are stable. It’s imperfect and often intuitive, but there is at least a person in the loop asking the right question before anything runs.

When you introduce an autonomous remediation agent,  one that can restart services, reroute traffic, scale resources, or modify configurations in response to detected anomalies,  that question disappears. The agent sees an anomaly. The agent takes an action. The action is a chaos event. No SLO burn rate check. No blast radius calculation. No human judgment about whether right now is the right moment to introduce additional stress into a system that may already be under pressure from three other directions.

Here is the specific failure mode I have watched play out. A remediation agent detects elevated latency on a microservice and responds by restarting the service cluster; a reasonable action given its training data and its narrow view of the incident. What the agent doesn’t know: Three other services are in the middle of handling peak traffic. The shared connection pool is already at 87% utilization. A dependent database is running a background index rebuild. The restart triggers a thundering herd against the recovering service.

What started as a latency spike the agent was designed to fix becomes a cascade the agent was never designed to model. The blast radius of that agent action was not the service restart. It was everything downstream of the restart, in a system state the agent had no complete picture of.

Nobody’s chaos engineering program had tested for that specific combination. Nobody’s blast radius calculation had included the agent as an actor. Because we don’t think of agents as chaos injectors. We should.

According to the AI Incidents Database, reported AI-related incidents rose 21% from 2024 to 2025. That count almost certainly understates the actual exposure, because most organizations have no incident classification that captures an autonomous agent action as the initiating cause of a cascade. The incident gets logged as a service restart, a connection pool saturation, or a latency event. The agent is invisible in the postmortem.

Absorb capacity is a resource; most systems don’t treat it that way

The underlying problem is that enterprise systems have no shared language for absorb capacity — the real-time estimate of how much additional stress a system can take before it breaches its SLO commitments. Chaos engineering programs manage it implicitly, through human judgment and static thresholds that fire after a limit has already been crossed. Agents don’t manage it at all.

Through structured primary research with site reliability engineering (SRE) and platform engineering practitioners across organizations including Intuit and GPTZero, I’ve been developing a resilience budget model. The core idea is to treat absorb capacity as a continuously recomputed, consumable resource rather than a static threshold you try not to breach.

A resilience budget draws on four live signal classes.

  • SLO burn rate is the primary input, because it directly encodes the distance between current system behavior and the commitment that actually matters. If a system is burning its monthly error budget at five times the expected rate, the resilience budget is near zero regardless of what CPU utilization looks like.

  • P99 latency trend matters more than absolute latency, because a service trending upward over forty minutes tells you something different than a service that has been stable at the same absolute value.

  • Dependency saturation state is the most commonly missed signal; a chaos experiment or an agent action that assumes a shared connection pool is freely available when it’s sitting at 87% will produce failure modes that nobody designed for.

  • Application behavioral signals,  session completion rates, API call pattern shifts, conversion degradation, and surface system stress earlier than infrastructure metrics do, because users feel the degradation before Prometheus reports it.

What makes this a budget rather than a threshold is that it is consumable. Every chaos experiment draws from the available capacity. Every agent action draws from it. In multi-team organizations where multiple experiments and multiple agents may be acting simultaneously, the budget is shared.

Without a shared ledger of consumption, two teams running experiments against overlapping dependencies produce a combined blast radius that neither team planned. Add autonomous agents acting completely outside the ledger, and the accounting collapses.

Where language models help,  and exactly where they fail

Several engineering organizations are now running experiments using large language models (LLMs) to generate chaos hypotheses from dependency graphs and incident postmortem corpora. The results are directionally useful. Language models surface plausible failure modes that experienced SREs recognize as worth testing, and they generate hypotheses faster than manual processes, particularly when working from rich postmortem history.

The limit is dependency graph staleness, and it is a hard limit. A hypothesis generated from a graph that doesn’t reflect last month’s service extraction, or a new shared library dependency added two sprints ago, will propose an experiment with incorrect blast radius assumptions. The problem is not that the model makes a mistake, it’s that the model doesn’t know it’s making one. It will be confidently incorrect about a system boundary that no longer exists, and in chaos engineering, confident incorrectness in production means an unplanned outage.

Stanford’s Trustworthy AI Research Lab found that model-level guardrails alone are insufficient: Fine-tuning attacks bypassed leading models in the majority of tested cases. The implication for chaos hypothesis generation is direct, a model that cannot reliably hold its own safety boundaries cannot be trusted to accurately model the blast radius of an action it has never seen in a dependency graph it has not verified.

When hypothesis generation draws instead from postmortem corpora, the staleness problem shrinks considerably. Postmortems describe failures that actually occurred in the system at a specific moment in time. The signal is inherently validated by production reality. This is the tractable near-term AI application in this space, and it is genuinely useful for organizations with mature incident documentation practices.

What AI cannot do,  and should not be asked to do, is make the execution decision when signals are ambiguous. That judgment requires awareness of things that live entirely outside any monitoring system: Pending deployments that changed the dependency landscape an hour ago, on-call staffing levels on a holiday weekend, a customer commitment that makes any additional risk unacceptable until Monday.

A model without access to that context should not be making that call. This is not a temporary limitation pending a more capable model. It is a structural constraint of what machine observability can represent, and building an agent architecture that ignores it is building one that will eventually make a consequential decision with incomplete information — and no human in the loop to catch it.

What this means for how enterprises govern agents in production

The governance implication is straightforward to describe and harder to implement than it sounds. Every autonomous agent action that touches infrastructure needs to register against the same live signal layer that governs chaos experiments. The same SLO burn rates, latency trends, dependency saturation states that a human engineer would check before initiating an experiment should gate what an agent is permitted to do and when. If the resilience budget is below a defined floor, the agent waits or escalates. It does not act.

Agent actions also need to be modeled as experiments, not just logged as events. When an agent restarts a service, the question isn’t only whether the restart completed successfully. It’s whether the blast radius of that action was proportionate to the available absorb capacity, and what cascading effects it produced across dependencies. That is chaos engineering data. It belongs in the budget model, feeding the next decision the agent or the team needs to make.

And when signals are genuinely ambiguous, when the budget score is unclear, when a recent deployment has changed the topology in ways the agent’s context window doesn’t capture, when dependency states are in flux,  the execution decision needs to go to a human. Not as a permanent limitation on agent autonomy, but as a hard engineering requirement for the current state of the technology.

A circuit breaker that hands ambiguous cases to a human is not a weakness in the agent architecture. It is the thing that makes the architecture trustworthy enough to actually run in production. Intent-based verification formalizes exactly this: Defining what correct agent behavior looks like before deployment, then continuously probing whether those boundaries hold under live system conditions.

The organizations that operate autonomous agents reliably at scale are not the ones with the most sophisticated models. They are the ones that understood, before something went badly wrong, that every agent action is a chaos event and built their governance layer accordingly.

The practical first step is unglamorous: Audit every autonomous agent currently touching infrastructure, map its action surface against your live SLO burn rate signals, and define explicit floor conditions below which the agent is required to wait or escalate. That audit will surface agents acting entirely outside your resilience accounting.

Most organizations running agents at scale today have several. Find them before production does.

Sayali Patil has spent 6-plus years at Cisco Systems and Splunk building the reliability and automation systems that keep enterprise AI infrastructure running at scale.

Your AI agents need a terminal, not just a vector database

When agentic workflows fail, developers often assume the problem lies in the underlying model’s reasoning abilities. In reality, the limited information provided by the retrieval interface is often the primary limiting factor.

Researchers at multiple universities propose a technique called direct corpus interaction (DCI) that lets agents bypass embedding models entirely, searching raw corpora directly using standard command-line tools.

The limits of classic retrieval

In classic retrieval systems such as RAG, documents are chunked, converted into vector representations (or embeddings), and indexed offline in a vector database. When an AI system processes a query, a retriever filters the entire database to return a ranked “top-k” list of document snippets that match the query. All evidence must pass through this scoring mechanism before any downstream reasoning occurs.

But modern agentic applications demand much more. “Dense retrieval is very useful for broad semantic recall, but when an agent has to solve a multi-step task, it often needs to search for exact strings, numbers, versions, error codes, file paths, or sparse combinations of clues,” the authors of the DCI paper said in comments provided to VentureBeat. “These long-tail details are precisely where semantic similarity can be brittle.”

Unlike static search, agents must also revise their search plans dynamically after observing partial or localized evidence. Exact lexical constraints and multi-step hypothesis refinement are difficult to execute with semantic retrievers. Because the retriever compresses access into a single step, any critical evidence filtered out by the similarity search cannot be recovered later, no matter how advanced the agent’s downstream reasoning capabilities are. As the authors explain, current retrieval pipelines can become a bottleneck because “they decide too early what the agent is allowed to see.”

Direct corpus interaction

This direct access addresses a core problem in enterprise environments: data staleness. Embedding indexes are always a snapshot of a specific moment in time, taking considerable compute and time to build and maintain.

“In many enterprise settings, the data is not a stable document collection. It is daily financial reports, live logs, tickets, code commits, configuration files, incident timelines, and internal documents that keep changing,” the authors said. DCI lets the agent reason over the current state of the workspace rather than yesterday’s vector index.

The agent operates in a terminal-like environment where its observations are raw tool outputs such as file paths, matched text spans, and surrounding lines. The core tools provided by DCI are few but highly expressive. Agents use commands like “find” and “glob” to navigate directory structures and locate files. For exact matching, they use “grep” and “rg” to locate specific keywords, regex patterns, and exact strings. When local inspection is needed, tools like “head,” “tail,” “sed,” “cat,” and lightweight Python scripts allow the agent to peek at the context surrounding a match or read specific file sections.

The agent can combine these tools via shell pipelines to execute complex search logic in a single step. An agent can pipe commands to enforce strict lexical constraints, such as searching a file for one term and piping the output to search for a second term. It can combine multiple weak clues across a corpus by finding a specific file type, searching for a keyword like “report,” and filtering for a year like “2024.” It can also immediately verify a hypothesis by inspecting the exact lines around a keyword match.

DCI delegates semantic interpretation directly to the agent instead of relying on embedding-based similarity search. The agent can formulate hypotheses, test exact lexical patterns, and extract detailed information that a traditional semantic retriever might miss.

The researchers propose two versions of this system. DCI-Agent-Lite is designed as a lightweight, low-cost setup built on the GPT-5.4 nano model and restricted purely to raw terminal interactions like bash commands and basic file reads. Because reading raw files can quickly fill up a smaller model’s memory, this version relies on lightweight runtime context-management strategies to sustain long-horizon exploration.

DCI-Agent-CC is the higher-performance version, designed for teams with more compute budget. It runs on Claude Code powered by Claude Sonnet 4.6. Claude Code provides stronger prompting, more robust tool orchestration, and superior built-in context handling, which improves the agent’s stability during complex, multi-step searches across heterogeneous datasets.

DCI in action

The researchers tested both versions of DCI across agentic search benchmarks like BrowseComp-Plus, knowledge-intensive QA with single-hop and multi-hop reasoning, and information retrieval ranking in tasks requiring domain-specific reasoning and scientific fact-checking.

They tested DCI against three baselines. The first included open-weight retrieval agents such as Search-R1 and proprietary agents powered by frontier models like GPT-5 and Claude Sonnet 4.6, paired with standard retrievers. The second baseline included classical sparse retrievers like BM25 and dense retrievers like OpenAI’s text-embedding-3-large and Qwen3-Embedding-8B. The third baseline consisted of high-performing reasoning-oriented re-rankers like ReasonRank-32B and Rank-R1.

DCI systematically outperformed the baselines, according to the researchers. On the complex BrowseComp-Plus benchmark, swapping a traditional Qwen3 semantic retriever for DCI on a Claude Sonnet 4.6 backbone improved accuracy from 69.0% to 80.0% while reducing the API cost from $1,440 to $1,016. The return on investment for lightweight agents was also noticeable. DCI-Agent-Lite with GPT-5.4 nano competed with the OpenAI o3 model using traditional retrieval while cutting costs by more than $600.

On multi-hop QA benchmarks, DCI-Agent-CC reached an 83.0% average accuracy, improving on the strongest open-weight retrieval baseline by 30.7 points, according to the researchers.

The data shows that DCI has lower overall document recall than dense embedding models, but once it finds a relevant document, it extracts substantially more value from it.

“If an enterprise AI lead asked where DCI is most clearly useful, I would point to tasks that require exact evidence localization in a dynamic workspace: debugging production incidents, searching large codebases, analyzing logs, compliance investigation, audit trails, or multi-document root-cause analysis,” the researchers note.

In one complex deep-research task, the agent had to identify a specific soccer match based on 12 interlocking clues, including exact attendance, yellow cards, and player birth dates. A traditional retriever would fail by surfacing short, disconnected snippets. Instead, the DCI agent explored the file directory, read specific lines of a 1990 England versus Belgium match report to verify the exact number of substitutions, pulled a specific quote from an interview file, and verified the exact birth dates of two players by peeking into their Wikipedia text files. By chaining these simple commands, DCI ensures that no evidence is permanently lost behind a flawed semantic search algorithm.

Limits and practical implementation of DCI

DCI has a clear operating envelope where it scales excellently in search depth but struggles with search breadth. When the experimental corpus was expanded from 100,000 to 400,000 documents, the system’s accuracy dropped significantly and the average number of tool calls rose. While DCI is powerful once a promising document is found, the cost of locating that initial useful anchor document grows sharply as the size of the candidate space increases.

DCI also has lower broad document recall compared to dense embedding models. It trades exhaustive recall for high-resolution, local precision. If an enterprise workflow strictly requires finding every single relevant document across a massive dataset, DCI may not be the right tool.

Granting an agent expressive tools like an unrestricted bash shell increases latency and compute costs due to the high volume of iterative tool calls required to complete a search. It also creates significant context-management and security challenges for IT departments.

“Tool calls can return large outputs; long trajectories can fill the context window; and raw terminal access requires sandboxing, permission control, and careful engineering,” the authors said. To manage the context window, the researchers found that moderate truncation and compaction help the agent sustain longer searches, whereas overly aggressive summarization tends to discard useful evidence.

Because of these operational realities, DCI is not meant to be a mandatory replacement for existing vector infrastructure. Instead, it serves as a complementary one.

“For orchestration engineers and data architects, our view is that the most practical near-term deployment pattern is hybrid,” the authors said. Semantic retrieval can still provide high-recall candidate discovery when a user’s intent is broad or underspecified. “DCI can then operate as a precision and verification layer: the agent can search within the retrieved documents, expand from them into neighboring files, check exact constraints, and combine weak signals across documents.”

The researchers have released the code for DCI under the permissive MIT license.

“Longer term, DCI changes how we think about enterprise data. Data will not only need to be stored for humans or indexed for search engines; it will need to be organized for agents that can inspect, compare, grep, trace, and verify,” the authors conclude. “File names, timestamps, stable identifiers, metadata, version history, and machine-readable structure become part of the retrieval interface.”

A 0.12% parameter add-on gives AI agents the working memory RAG can’t

AI agents forget. Every time a coding assistant loses track of a debugging thread, or a data analysis agent re-ingests the same context it already processed, the team pays in latency, token costs, and brittle workflows. The fix most teams reach for — expanding the context window or adding more RAG — is increasingly expensive and still doesn’t reliably work.

To address this, researchers from Mind Lab and several universities proposed delta-mem, an efficient technique that compresses the model’s historical information into a dynamically updated matrix without changing the model itself. The resulting module adds just 0.12% of the backbone model’s parameters — compared to 76.40% for one leading alternative — while outperforming it on memory-heavy benchmarks. Delta-mem allows models to continuously accumulate and reuse historical data, reducing the reliance on massive context windows or complex external retrieval modules for behavioral continuity.

The long memory challenge

The conventional solution is to simply dump all the information into the model’s context window.

But as Jingdi Lei, co-author of the paper, told VentureBeat, current systems treat memory merely as a context-management problem. “Either we keep expanding the context window, or we retrieve more documents through RAG,” Lei explained. “These approaches are useful and will remain important, but they become increasingly expensive and brittle when agents need to operate over long-running, multi-step interactions, and they don’t really [work] like human memory since they are more like looking up documents.”

In enterprise settings, the bottleneck is not just whether the model can access history, but whether it can reuse that history efficiently, continuously, and with low latency. Standard attention mechanisms incur a quadratic computational cost as the sequence length increases. Furthermore, expanding the context window does not guarantee the model will actually recall the information effectively. Models often suffer from context degradation or context rot as they become overwhelmed with more (and often conflicting) information, even if they support one million tokens in theory.

The researchers argue for advanced memory mechanisms that can represent historical information compactly and maintain it dynamically across interactions. Existing solutions come with heavy trade-offs and generally fall into three paradigms:

  • Textual memory: stores history as text injected into context — constrained by window limits and prone to information loss under compression.

  • Outside-channel (RAG): encodes and retrieves from external modules — adds latency, integration complexity, and potential misalignment with the backbone.

  • Parametric: encodes memory into model weights via adapters — static after training, can’t adapt to new information during live interactions.

Inside delta-mem

To achieve a compact and dynamically updated memory, delta-mem compresses an agent’s past interactions into an “online state of associative memory” (OSAM). This state is maintained as a fixed-size matrix that preserves historical information while the underlying language model remains frozen.

For enterprise workflows, this translates directly to resolving operational bottlenecks. Lei noted that a persistent coding assistant, for example, “may need to remember project conventions, recent debugging steps, user preferences, or intermediate decisions across a workflow.” Similarly, a data analysis agent might “need to maintain task state, assumptions, and prior observations while iterating over multiple tool calls.” 

Rather than repeatedly retrieving and re-inserting all relevant history for these tasks, the delta-mem matrix provides a low-overhead way to carry forward useful interaction states inside the model’s forward computation.

During generation, the system does not retrieve raw text segments to add to the prompt. Instead, the backbone LLM’s current hidden state is projected into the matrix to retrieve old memory. This operation extracts context-relevant associative memory signals from delta-mem. These signals are then transformed into numerical corrections that are applied to the computations of the model. This steers the model’s reasoning at inference time without altering its internal parameters.

Following each interaction, delta-mem updates the online state using “delta-rule learning.” When new information arrives, the previous state makes a prediction about the resulting attention values. It then compares this prediction to the actual value and corrects the memory matrix based on the discrepancy.

This update mechanism relies on a “gated delta-rule.” Basically, the memory module has different knobs that control how much previous memory is kept and how much of the new memory is applied. This error correction with controlled forgetting allows the matrix to evolve over time, holding onto stable historical associations without being derailed by short-term noise.

The researchers explored three strategies for determining when and how the matrix updates:

  • Token-state write captures fine-grained changes but is vulnerable to short-term noise.

  • Sequence-state write averages tokens within a message segment, smoothing updates at the cost of some localized detail.

  • Multi-state write decomposes memory into sub-states for different information types like facts or task progress.

Delta-mem in action

The researchers evaluated delta-mem across three LLM backbones: Qwen3-8B, Qwen3-4B-Instruct, and SmolLM3-3B. They configured the framework with a compact 8×8 matrix. The system was tested on general capability benchmarks, including HotpotQA, GPQA-Diamond, and IFEval. It was also evaluated on memory-heavy tasks such as LoCoMo, which tests long-term conversational memory, and Memory Agent Bench, which assesses retention, retrieval, selective forgetting, and test-time learning over extended interactions.

The framework was compared against representative models from the three existing memory paradigms: textual memory baselines (e.g., BM25 RAG, LLMLingua-2, and MemoryBank), parametric systems (Context2LoRA and MemGen), and the outside-channel approach MLP Memory.

Across the board, delta-mem outperformed the baselines, according to the researchers. On the Qwen3-4B-Instruct backbone, the token-state write variant achieved an average score of 51.66%, easily surpassing the frozen vanilla backbone at 46.79% and the strongest baseline, Context2LoRA, at 44.90%. On the memory-heavy Memory Agent Bench, the average score jumped from 29.54% to 38.85%. Performance on the specific test-time learning subtask nearly doubled from 26.14 to 50.50.

However, the most compelling takeaways are the system’s operational efficiency. The researchers tested the framework in a no-context setting where the historical text was entirely removed from the context. Even without explicit text replay, delta-mem successfully recovered context-relevant evidence in multi-hop tasks. The researchers argue that the model remembers past interactions without needing to ingest massive amounts of prompt tokens.

The framework also adds only 4.87 million trainable parameters, representing just 0.12% of the Qwen3-4B-Instruct backbone. By comparison, the MLP Memory baseline required 3 billion parameters, scaling up to 76.40% of the backbone’s size while delivering inferior results. When prompt lengths scaled up to 32,000 tokens during inference tests, the framework maintained almost the exact same GPU memory footprint as a standard, unmodified model. It sidesteps the heavy memory bloat that affects other advanced memory systems like MemGen and MLP Memory.

Different update strategies proved beneficial depending on the underlying model capacity. The sequence-state write strategy was the most effective for stronger backbones like Qwen3-8B. These more capable models use the segment-level writing to smooth out updates and mitigate token-level noise. Conversely, the multi-state write strategy drove massive performance leaps for smaller backbones like SmolLM3-3B. For these lower-capacity models, separating memory into multiple states proved critical to minimizing information interference.

Implementing delta-mem in the enterprise stack

The researchers have released the code for delta-mem on GitHub and the weights for their trained adapters on Hugging Face. For AI engineering teams looking to integrate this framework into their existing inference stack, the process requires minimal computing resources.

“In practice, an engineering team would start from an existing instruction-tuned backbone, attach the Delta-Mem adapter modules to selected attention layers, train only the adapter parameters on domain-relevant multi-turn or long-context data… and then run inference with the memory state updated online during interaction,” Lei said. Crucially, teams do not need a massive pretraining corpus. The training data only needs to reflect the target memory behavior, such as multi-turn dialogues, agent traces, or domain workflows where earlier information must influence later decisions.

While compressing interaction history into a fixed-size mathematical matrix creates immense efficiency, it does come with trade-offs. Delta-mem is not a lossless replacement for explicit text logs or document retrieval. Because different pieces of information compete inside the same limited state, there is a risk of memory blending.

“Delta-Mem is useful when the system needs fast, online, continuously updated behavioral state,” Lei said. “RAG is better when the system needs exact factual recall, citation, compliance, auditability, or access to a large external knowledge base.” Remembering a user’s working style or a multi-step reasoning trajectory is a perfect fit for delta-mem, while retrieving a legal contract or a medical guideline should remain in a vector database.

This means the most realistic enterprise architecture moving forward is a hybrid approach. Delta-mem acts as a lightweight internal working memory, reducing the need to retrieve or replay everything all the time, while RAG serves as the explicit, high-capacity memory layer.

“Looking ahead, I do not think vector databases will become obsolete,” Lei said. “Instead, I expect enterprise AI stacks to become more layered. We will likely see short-term working memory inside the model, longer-term explicit memory in retrieval systems, and policy or audit layers that decide what should be stored, retrieved, forgotten, or exposed to the user.”

Google’s Managed Agents API promises one-call deployment at the cost of execution layer control

At Google I/O, the company unveiled Managed Agents in its Gemini API — a service that promises to collapse weeks of agent deployment work into a single API call. It’s also a sign that Google believes its ecosystem, including the newly launched Antigravity CLI, is ready to own the execution layer end-to-end.

Before a single agent is written, teams are already spending days on the unglamorous work: standing up execution environments, managing sandboxes, wiring tool call infrastructure. Model providers like Anthropic have launched platforms to handle much of that work — but Google’s approach is different.

Google said in a blog post that Managed Agents in the Gemini API abstracts “away the complexity so that you can focus on your product experience and agent behavior.” The service is available in preview via new custom templates in Google AI Studio.

The growth has introduced a real architectural question: should agent management live at the execution layer — embedded in the model or its harness — or at the infrastructure layer, as a separate runtime?

Comparing Google’s approach

Until recently, agent orchestration relied on frameworks that sat above the model, directing agents and letting teams control routing and execution separately. That layer is now being absorbed by the platforms themselves.

Recent platforms like Claude Managed Agents embed orchestration at the model layer rather than on a separate runtime platform. The idea is that the model owns the reasoning and orchestration layers, and enterprises have control over execution. 

AWS, through new capabilities on Bedrock AgentCore, adds managed harnesses that stitch together the upfront tasks for deploying agents. Google’s approach goes further, optimizing the model, harness, and sandbox together and running everything in secure Google-managed environments.

René Sultan of Ramp, cited in Google’s announcement, said the shift is concrete: “The real shift with Gemini Managed Agents is that the agent runtime moves into the platform. With the sandbox, infrastructure and execution loop managed for you, developers can focus on productizing the agent’s domain-specific behavior and iterating at a completely different pace.”

The new orchestration reality 

Enterprises starting fresh with agents could find the platform offerings from Anthropic and Google strong, especially since they remove much of the difficulty of deploying agents while still maintaining some control. Google, however, is pushing for a more vertically integrated system, while Anthropic is betting on the model layer as an orchestration plane, and AWS focuses on authorization. 

But this also brings some risks, according to XYO founder and chief executive Arie Trouw.

“An additional risk is that developers will switch out what previously were deterministic services for what will now be probabilistic services, which can introduce unpredictable outcomes for the users at best, or data corruption at worst,” Trouw told VentureBeat in an email. “This is the classic example of having an amazing hammer and everything starting to look like nails. I’ve seen this pattern repeatedly as a developer and business founder myself in the past few decades.”

Enterprise AI agents keep failing because they forget what they learned

RAG architectures are good at one thing: surfacing semantically relevant documents. That’s also where they stop.

A framework called a decision context graph addresses that gap by giving agents structured memory, time-aware reasoning, and explicit decision logic. Rippletide, a startup in the Neo4j ecosystem, has built one. The key capability: agents that are non-regressive, able to freeze validated sequences of actions and compound on them over time.

“The key point you want is non-regressivity: How do you make sure that, when the agent will generate something new, you can compound on the previous discoveries?” said Yann Bilien, Rippletid’s co-founder and chief scientific officer. 

Why RAG doesn’t go far enough

Enterprise context is sprawled across ERP tools, logs, databases, vector stores, and policy documents. Generative AI tools can retrieve from all of it — through keyword search, SQL queries, or full RAG pipelines — but retrieval has a ceiling.

Notably, data retrieved may not be relevant to the decision at hand (thus causing hallucinations); and, even if agents do pull the right data, they often lack guidance to make decisions backed by a strong rationale.

That is, RAG retrieves documents, not decision context. “Everyone starts with RAG: Pull relevant docs, stuff them in the prompt, let the model figure it out,” said Wyatt Mayham of Northwest AI Consulting

While that works fine for chatbots, it “breaks immediately” for agents that need to make decisions and take actions, he pointed out. “The biggest thing builders struggle with is the gap between retrieval and applicability.” 

A retrieved document doesn’t tell the agent whether it still applies, whether it’s been superseded, or whether there’s a conflicting rule that takes priority, Mayham said. “Agents need decision context, not just information.”

In construction (the human world), that might mean knowing that a pricing exception expired, that a safety policy only applies in certain jurisdictions, or that a standard operating procedure was updated a month prior. “Miss any of that, and the agent confidently does the wrong thing,” Mayham said. 

Without structured decision context, agents combine incompatible rules, invent constraints to fill gaps, and rely on what Bilien calls “probabilistic guesses over unbounded data.” Errors are difficult to reproduce because builders can’t trace why the agent made a given choice.

The compounding error problem is real, too, Mayham said: A small miss rate per step becomes “catastrophic” across a multi-step workflow. “That’s the main reason most enterprise agents never leave the pilot phase.” 

How decision context graphs get to the relevant answer 

A decision context graph solves this by encoding a structured map of what is applicable, what the rules are, and when they apply.

The framework is optimized for one question: “Given this situation, which context applies right now?” Time is treated as a first-class dimension; every rule, decision, and exception is scoped to when it is valid.

“The goal is to explicitly address missing, incoherent, or contradictory data when building the graph to avoid probabilistic [errors] once the agent is running,” Bilien said. 

The system is built around three principles:

  • Applicability: Logic is explicitly encoded so the agent knows what rules to remember and apply in a given situation. Context is returned only when it is relevant to the situation. 

  • Time‑aware memory: Every rule, decision, and exception is time-scoped. This allows agents to reason about “What was true then versus what is  true now,” then reproduce or explain its decisions.

  • Decision paths: The system can explain how it got from A to B and the “why” behind its rationale (for instance, why one piece of context was included and another was not). Agents are given “decision path” examples of how similar cases were handled before. 

At setup, unstructured data is ingested and structured into an ontology: what entities exist, what rules apply, what counts as an exception. Neuro-symbolic AI handles the pattern recognition and encodes formal, machine-readable logic. Over time, the system refines its knowledge base as new decisions are made.

“Neuro-symbolic brings two parts: A neuronal part giving a large autonomy to agents and a symbolic part to reduce the number of data needed and bring control,” Bilien said. 

The agent is tested at build time (pre-production) to validate its behaviors or pinpoint improvements. This reduces risks as well as computation needs during inferencing, he noted. 

Agents learning, rather than regressing 

When it comes to non-regression, the key piece is compounding both on intelligence (models) and on knowledge (shared between agents), Bilien said. It’s important that agents can explore; when they don’t know how to accomplish a task, they can attempt different possibilities, typically in a controlled environment or simulation (like a support bot trying multiple response patterns). 

Then, “once a solution is evaluated as satisfactory, the graph freezes that sequence of actions,” Bilien said. Future exploration then starts from this “stable base of validated behaviors” to prevent newly-acquired skills from overwriting previously learned good behavior. 

Before an agent acts or affects a customer, it checks against the graph: Is it violating a rule? Hallucinating? Staying within constraints? Can it generalize the solution across similar cases?

At a macro level, the system assesses outcomes: Did the behavior improve long-term performance? Did it generalize across similar contexts? Did it preserve previous capabilities?

“This determinism is key for agents to run reliability at scale,” Bilien said. It leads to behavior that is more consistent, predictable, explainable, and allowing for stronger control and auditability. 

“You want your agents to be able to learn by themselves when they face something they don’t know,” he said. “You want them to be able to explore and find new solutions.”

Getting beyond “episodic” memory

While the team initially assumed it would deploy RL everywhere, “that actually proved very difficult in an enterprise setting,” Bilien said. “Data are scarce for some specific use cases and messy for others.”

Typically, using raw data for reliable predictions has been a manual and time-consuming challenge, but “now with agents we entered a new era where building ontologies is possible automatically,” Bilien said. 

Classic supervised fine-tuning methods can lead to oscillations, when models forget the last skill they learned while learning the next tone. Overall, learning is not compounded, compression is “dramatic,” and models improve “episodically” rather than continuously, leading them to continually fail on new or unseen tasks. 

As Bilien noted: “You will never have a fully self-learning model if you are regressing every time.” 

In enterprise use cases — like banking where millions of transactions are processed a day — a high level of reliability is critical, he noted. “One question I ask all customers: Is 95% enough? In a lot of use cases, it’s not. You need 99.999%. 1% off is way too much.” 

Decision context graphs can close that gap, he contends: When the same customer support question is asked repeatedly, the agent will return a “satisfactory” answer predictably and without regression, all while retaining autonomy. 

Encoding applicability and temporal validity into a structured graph — rather than relying on an LLM to infer it — is a “sound approach” to a real limitation in existing retrieval frameworks, Mayham said. The open question is whether the automatic ontology generation holds up against the messy, diverse data that enterprises actually have. “That’s always the hard part,” he said.

NanoClaw’s creators are turning the secure, open source AI agent harness into an enterprise ‘second brain’

The creators of NanoClaw — the hit open source, enterprise-friendly variant of autonomous AI agent harness OpenClaw — are moving towards commercializing their technology for enterprises at scale, aiming to provide them with secure AI agents, and an ever-updating library of workplace context, for each human employee the enterprise has approved.

The duo, including former Wix.com engineer Gavriel Cohen and his brother Lazer Cohen, also founder of tech public relations firm Concrete Media, shared with VentureBeat that their new startup, NanoCo AI, has received a $12 million oversubscribed seed round was led by Valley Capital Partners.

The round features a roster of strategic backers that reads like an enterprise infrastructure all-star team, including Docker, Vercel, monday.com, Factorial Capital, and Hugging Face CEO and founder Clem Delangue.

Buoyed by the seed round, NanoCo AI wants to move beyond basic automation to offer every enterprise worker a secure “professional assistant.” Yet they are still committed to building out and maintaining NanoClaw as an MIT Licensed, enterprise-friendly, open source standard — just offering specialized commercial managed services integration atop it.

The new killer use case: an informed, ever-updating personal assistant for each human worker

Gavriel, now CEO of NanoCo AI, sees this personalized approach as the ultimate unlock for the modern worker.

“The killer use case is the the one to one we’re calling it professional assistant,” Cohen explained in a recent exclusive interview with VentureBeat. “If you can give someone an agent and make them twice, three times as effective, then you probably want more people as well, right?”

He noted that as users forward emails, documents, and call notes to the agent, it systematically builds an “LLM wiki” — similar to the “LLM Knowledge Base” concept articulated by influential AI researcher Andrej Karpathy — effectively creating a dynamic knowledge graph of the user’s specific job and projects.

This persistent memory allows the agent to shift from simply answering questions to actively transforming information and executing first drafts that rival human output.

Cohen emphasized that NanoClaw acts as a massive productivity multiplier rather than a headcount replacement.

One-to-one secure ‘lobster’ AI

NanoCo’s core offering is a one-to-one professional AI assistant designed to shadow employees, draft contracts, review code, and manage accounts directly within tools like Slack and Microsoft Teams.

Rather than a generic chatbot, the assistant learns the employee’s role and adapts to their specific working style through ordinary conversation.

How does NanoCo prevent this highly capable assistant from going rogue? By moving security away from fragile prompt engineering and embedding it directly into the infrastructure.

Unlike its predecessor and inspiration, the even popular open source AI assistant OpenClaw — which grew to a massive 400,000 lines of code — NanoClaw’s core logic was intentionally minimized to roughly 500 lines of TypeScript. This minimalism ensures the entire system can be audited by a human security team in about eight minutes.

Furthermore, every NanoClaw agent operates within a strictly isolated environment. Leveraging a strategic partnership with Docker announced in March, NanoCo AI runs these agents inside MicroVM-based Docker Sandboxes.

“In NanoClaw, the ‘blast radius’ of a potential prompt injection is strictly confined to the container and its specific communication channel,” Cohen previously explained.

To prevent unauthorized actions, raw API credentials never reach the agent itself. Instead, outbound requests pass through a secure OneCLI Rust Gateway that enforces company-defined policies. If an agent attempts a sensitive “write” action—like modifying a cloud environment or deleting an email—the gateway intercepts the request and pings the human user via a rich interactive card on Slack, Teams, or WhatsApp.

Only when the user explicitly taps “Approve” does the system inject the credential. It is the architectural equivalent of a highly capable junior employee drafting an important corporate communication, but being physically unable to click “send” without the manager turning a literal launch key.

Continued commitment to open source, MIT License

Despite its new enterprise push, NanoCo AI is maintaining its commitment to its open-source foundation. The core NanoClaw framework remains available under the permissive MIT License, meaning independent developers and companies can continue to fork, modify, and run the system locally.

In plain terms, the MIT License allows anyone to use the software commercially without paying NanoCo AI, provided they include the original copyright notice.

NanoCo AI’s monetization strategy instead focuses on the vast majority of enterprises that lack the specialized engineering resources to build, maintain, and scale internal agent platforms.

While highly technical teams can choose to build their own infrastructure on top of the open-source code, NanoCo will sell managed, organization-wide deployments, taking on the burden of health checks, integrations, and ongoing security maintenance.

Widespread global adoption

The open-source adoption of NanoClaw has been staggering, crossing 250,000 downloads and nearing 29,000 GitHub stars since its debut. This ground-up momentum is entirely responsible for the surging enterprise demand.

“Countless enterprise executives have told us the same thing,” Cohen stated in the press release. “They’re running NanoClaw personally, getting two and three times more done, and asking how to roll it out to their teams.”.

Perhaps the most high-profile validation came during the founders’ recent trip to Singapore. The country’s Foreign Minister, Dr. Vivian Balakrishnan, invited the NanoCo team to his office after publicly posting about his personal use of NanoClaw. Balakrishnan described the agent as “getting smarter over time,” referred to it as his “second brain,” and stated he wouldn’t “dare switch it off”.

Cohen put the platform’s security claims to the ultimate test during a live conference demonstration in Singapore. He invited a crowd of 300 people to chat simultaneously with his personal agent, which was actively connected to his real email and calendar.

Thanks to NanoClaw’s zero-trust gateway architecture, the agent safely rejected malicious attempts to access his inbox or delete existing events, while successfully allowing 12 attendees to book legitimate coffee chats.

As AI shifts from a novelty tool that answers questions into a digital workforce that autonomously executes tasks, NanoCo AI is betting that verifiable security will be the defining metric of success. By combining a transparent open-source core with strict, infrastructure-level sandboxing, they aren’t just selling an assistant; they are selling the peace of mind required for enterprises to actually use one.

Claude agents can finally connect to enterprise APIs without leaking credentials

The reason enterprises have been slow to connect AI agents to internal APIs and databases isn’t the models — it’s the credentials. In most production deployments, the agent carries authentication tokens with it as it executes tool calls, which means a compromised or misbehaving agent takes the keys with it.

Anthropic is addressing that problem with two new capabilities for Claude Managed Agents: self-hosted sandboxes, which let teams run tool execution inside their own infrastructure perimeter, and MCP tunnels, which connect agents to private MCP servers without exposing credentials in the agent’s context. Together they move credential control to the network boundary rather than leaving it inside the agent.

Right now, self-hosted sandboxes are available to Claude Managed Agent users in public beta, while MCP tunnels are currently in research preview.  

Anthropic isn’t the only model provider making this bet. OpenAI added local execution to its Agents SDK in April in response to similar demand. The architectural distinction Anthropic draws is a split: the agent loop runs on Anthropic’s infrastructure, while tool execution runs on the enterprise’s own system — a separation that existing sandbox approaches, including OpenAI’s, don’t make.

The architecture problem in sandboxes and agents

MCP moved to enterprise production faster than the security architecture around it matured. In most deployments, credentials travel through the agent itself as it executes tool calls against internal systems — meaning a compromised or misbehaving agent has everything it needs to cause damage.

Self-hosted sandboxes, such as those offered on Claude Managed Agents, help keep files and packages within an enterprise’s infrastructure. The agentic loop—orchestration, context management and error recovery—moves to the platform, and ideally, enterprises control compute resources. 

This allows the agent to complete tool calls without holding the keys that unlock it. 

Private network connectivity works similarly — a lightweight outbound-only gateway inside the organization’s network, with no credentials passing through the agent.

Orchestration teams get some control

For orchestration teams, the capabilities represent more than just a security update; they help agents run better. But the first thing they need to understand is how this split architecture can affect their deployment. 

Since sandboxes determine tool execution locations and the resources agents access, and MCP tunnels tell agents how to reach internal systems, these are separate concerns—splitting them up enables enterprises to map agents’ workflows more effectively.

For teams already on Claude Managed Agents, the practical starting point is sandboxes — move tool execution onto your own infrastructure and test the boundary before touching MCP tunnels, which are still in research preview. Teams evaluating the platform for the first time should treat the sandbox architecture as the primary technical differentiator: it’s the piece that changes the threat model, not just the deployment model.

LangSmith Engine closes the agent debugging loop automatically — but multi-model enterprises still need a neutral layer

Enterprises building and deploying agents have a problem: it’s taking their engineers too long to find out that an agent made a mistake, and the loop has continued to perpetuate, especially without a human at every step. 

LangSmith, the monitoring and evaluation platform from LangChain, launched a new capability in public beta that could make that issue more manageable. LangSmith Engine automates the entire chain by detecting production failures, diagnosing root causes against the live codebase, drafting a fix and preventing regression. It does this in a single automated pass. 

LangSmith Engine gives AI engineers a faster path to triage, but it launches into a crowded field: Anthropic, OpenAI and Google are all pulling observability and evaluation into their own platforms.

LangSmith Engine looks at failures

LangChain said in a blog post that the typical agent development cycle starts by tracing the agent to understand what it’s doing, followed by identifying gaps, making changes to the prompts and tools, and creating ground-truth datasets. Developers then run experiments and check for regressions before shipping the agent. 

The problem is that customers often run into issues when the trace review doesn’t surface faulty patterns, error repetition gets difficult to see, and there’s no targeted evaluator to catch the same problem when it repeats in production.

LangSmith Engine works by monitoring production traces for several signal types, “explicit errors, online evaluator failures, trace anomalies, negative user feedback and unusual behaviors like user asking questions the agent wasn’t built to answer,” according to the blog post.

Engine will then read the live codebase, find the culprit and draft a pull request before proposing a custom evaluator for that specific failure pattern. The human comes in at the approval step. 

It’s built on top of LangSmith’s existing tracing and evaluation infrastructure and also works with an enterprise’s evaluator results. 

Unlike observability tools such as Weights & Biases, Arize Phoenix and Honeyhive, LangSmith Engine takes the entire chain automatically — detecting the failure, diagnosing root cause, drafting a fix — and brings the human in only at the approval step.

Model providers bringing evaluators in platform

While LangSmith identified this evaluation loop as a need for many enterprises, Engine comes at a time where the larger providers are beginning to offer observability tools within their platform. This means enterprises may choose to use an end-to-end platform rather than add LangSmith Engine onto their existing workflows. 

Anthropic’s Claude Managed Agents brings together agentic deployment, evaluation and orchestration into a single suite. OpenAI’s Frontier offers a similar end-to-end platform for building, governing and evaluating enterprise agents — though both have faced questions from enterprises wary of committing to a single vendor.

However, practitioners point out that not everyone wants to bring evaluations and observability fully into one platform.

Leigh Coney, founder and principal consultant at Workwise Solutions, told VentureBeat that third-party observability is the default for many enterprises. 

“One fund I work with runs Claude for analysis and GPT for a separate workflow. If observability lives inside each provider’s tooling, you now have two systems that can’t talk to each other. Your compliance team can’t produce a unified audit trail,” he said. “So third-party observability is surviving because multi-model is already the default in enterprise, and somebody has to sit across providers.”

Jessica Arredondo Murphy, CEO and co-founder of True Fit, said independent platforms like LangSmith have to prove to enterprises that they can “answer the long-term question of whether they become the cross-model operating layer for quality and reliability.”

“Enterprises are not consolidating onto the first-party model provider tooling as quickly as the model providers would prefer. What I see is a pragmatic split: teams will use first-party tooling for fast onboarding and early-stage debugging, but as soon as they care about production reliability, governance, and long-term flexibility, they tend to introduce a more neutral layer for observability and evaluation,” she said. 

LangSmith Engine is available now in public beta. Teams can connect a tracing project, optionally connect their repo, and Engine will begin surfacing issues from production traces automatically.