Admin.Foundation » Category » Orchestration

Fine-tuning forgets. RAG leaks context. Hypernetworks build the model your agent needs on demand.

Enterprise teams keep watching the same thing happen. An AI agent demos beautifully, goes to production, and stalls: it runs for a short stretch, then needs a human to top up its context and check its output, and the promised efficiency drains into supervision. The agent did the work; you did the watching. It’s one reason so many agent pilots never turn into production systems.

The pitch on the other side of that wall is the one every team wants to believe: an agent that runs a long job on its own, overnight if it has to, and leaves a person to validate only the last 10%. Whether that is achievable turns on a problem the orchestration conversation mostly skips. When AI firm Chroma tested 18 leading models, every one lost accuracy as its input grew, a property of how attention works, not a gap a stronger model closes. An agent fed more and more of your business as it runs does not get steadier. It gets shakier.

This is the layer beneath the orchestration race. Routing, durable execution and observability all assume each agent is already competent enough to coordinate in the first place. The deeper question is how long an agent can run before a human has to step in, and that comes down to where your company’s knowledge lives relative to the model. Both standard fixes leave a human in the loop.

Why teaching a model your business keeps you in the loop

Frontier models keep getting more capable, and the gap does not close, because it is not a capability problem. It is about where your knowledge sits relative to the model, and enterprises have had two ways to place it there.

The first is fine-tuning, which bakes knowledge into the weights. It remains subject to catastrophic forgetting, a problem identified in the 1980s and still unresolved in 2026: teaching a model something new tends to erode what it already knew. Teams work around it by isolating each task in its own fine-tuned model or adapter, which produces a sprawling estate of models that raises cost and governance overhead. And a fine-tuned model is a snapshot, stale the day a policy changes, when the expensive, slow retraining cycle starts over.

The second is in-context learning, which skips retraining by placing the relevant policies in the prompt at run time. This is where context rot bites. Retrieval narrows what goes into the prompt, but a retrieval miss looks identical to a confident answer, and both cost and latency climb with every token added.

The two failures rhyme. With fine-tuning, the model can be confidently working from last quarter’s policy. With in-context learning, it can be confidently working from a detail it lost in the middle of a long prompt. Either way the output looks equally assured, so you cannot tell which parts are wrong without checking all of them. That is why the human never gets to leave. Some teams often run both at once, fine-tuning the stable knowledge and retrieving the rest. That softens each failure but removes neither: on any given output you still cannot be sure the model is both current and working from the right context, so you still check it.

A third path: generate the specialist model on demand

A third approach is moving from research into early product. Instead of retraining one model or stuffing its prompt, a generator builds a small, task-specific model on demand from your policies, at inference time. The generator is a hypernetwork: a network whose output is the weights of another network.

The idea was named in 2016; applying it to produce specialist language models from text or documents is recent and active. Sakana AI’s Text-to-LoRA, presented at ICML 2025, generates a model adapter from a plain-language description in a single pass, and a 2026 system called SHINE calls hypernetwork adaptation a promising new frontier, precisely because it sidesteps both the retraining cost of fine-tuning and the context limits of prompting.

The point of generating adapters rather than training and storing them is to collapse a sprawling library of per-task LoRAs into one network that can produce them on demand, including for tasks it has not seen.

The elegant part is how this closes the loop on the problem above: the per-task adapter teams hand-build to dodge catastrophic forgetting is the same object a hypernetwork produces automatically. The model zoo stops being a governance headache and becomes a generated output.

The case for going small underneath all this was put most directly in a 2025 paper by Nvidia researchers: for the narrow, repetitive tasks that fill agent workflows, small models are capable enough and 10 to 30 times cheaper to run than frontier generalists. Nace.AI, a Palo Alto company that raised a $21.5 million seed round in May, is the clearest commercial instance. Its core technology, a generator it calls a MetaModel, produces parameter adaptations for a model at inference time from a company’s policies, pointed at regulated work: audit, compliance, risk assessment. The company says its agents handle the bulk of a workflow while human experts validate the result, a split it markets as 90/10.

How the three approaches compare

	Fine-tuning	In-context / RAG	Hypernetwork-generated model
Where business knowledge lives	In the model’s weights	In the prompt, re-supplied each run	In on-demand generated weights
Cost to update on a policy change	High: retrain	Low: edit the source	Low: regenerate
Staleness	High: a snapshot	Low	Low: regenerated from current policy
Per-call cost and latency	Low	High, grows with context	Low at run time
Dominant failure mode	Forgetting; model-zoo sprawl	Context rot; silent retrieval misses	Generator quality; calibration
Who owns the improving asset	Whoever trains the model	Whoever holds the data store	Depends where generator and feedback live

Why a hypernetwork-built model raises the autonomy ceiling

A model that is narrow, current and small has a smaller surface on which to be wrong. Fewer errors, confined to a known domain, mean fewer outputs an agent has to escalate to a person, which is the real basis for any high-autonomy claim. It is also where a number like 90/10 comes from: not a dial set in advance, but an outcome of how little the system needs to hand back. Reported autonomy shares are best read as measurements of an architecture, not as settings.

Two design choices decide whether that autonomy is trustworthy or merely fast. The first is grounding: tying every output to its source so a reviewer can verify rather than redo. Research models built for exactly this, such as HalluGuard, label each claim as supported or not and cite the passage they relied on. Nace ships its agents with grounding models and reasoning traces for the same reason. A 10% review only means something if the human can confirm provenance in seconds.

The second is the feedback loop, and it forces a question every buyer should ask: when your experts validate the output, whose model improves, and where does it live? That decides whether the compounding asset belongs to the vendor or to you. Arrangements differ. Nace, for instance, uses an external network of certified experts for some engagements and, for direct enterprise deployments, the customer’s own staff, with the resulting model kept inside the customer’s cloud. Each choice routes the learning, and the ownership, somewhere different.

Where the third path breaks

The approach is still early, and a few questions will decide how far it goes. Calibration is the linchpin: the value rests on the model knowing when it is unsure. And it is genuinely unsettled, recent work generating these adapters found they do not automatically improve calibration over ordinary fine-tuning, with gains appearing only under specific constraints.

The quality of the generated model also depends heavily on the policy data it is built from, which puts a premium on data curation. And scale is the open research frontier, the hypernetworks shown in published work so far have been small. This is where Nace’s own work gets interesting: in our interview, the company said it has scaled its generator well beyond those published sizes and derived a scaling law for how performance grows, results it has begun to share publicly and is now putting through peer review. If it holds up, it would help answer one of the central open questions in the field, and it is the paper worth watching.

Whichever approach wins, the work still ends at a human, and that handoff is its own design problem. When Deloitte Australia delivered a roughly A$440,000 government report, it shipped with fabricated citations and an invented court quote after passing senior review, because the reviewers checked the conclusions, which were sound, and not the provenance, which was not. Controlled research suggests the pattern is general: experts corrected an identical flawed recommendation less often when it was labeled AI-generated.

The EU AI Act’s Article 14 now names this automation bias. The lesson is not about any one vendor: a high autonomy share concentrates human attention into a thin, late slice of the work, so the value of that review depends entirely on whether the human can check provenance fast, which loops back to grounding.

What to build, and what to ask before you buy

The honest takeaway: what holds your agents back is usually not orchestration or model size, but whether the model knows your business well enough to be left alone, and the right fix depends on the job. To automate a long, repetitive, high-volume process end to end, run most of your internal audit overnight and have your own experts check the final slice, a hypernetwork generated model is the approach most likely to do it cheaply and run long enough to matter. For a short task that finishes in a few steps and never needed to run unattended, the gap between this and a well-prompted frontier model shrinks to almost nothing, and is not worth the integration cost.

When a vendor pitches autonomous or specialist agents, four questions cut through it.

Where does the business knowledge live: in the weights, the prompt, or generated on demand?
What does each output come with, so a reviewer can verify it instead of redoing it?
What decides which work gets escalated to a human?
And whose model improves from that feedback, and where does it run?

The answers, not the headline ratio, tell you what you are buying.

The hypernetwork approach is the most credible attempt yet at making a small model know a specific business without forgetting it and without re-explaining it on every run. It is also the least proven, and the parts that matter most, calibration and scale, are still in peer review. For the right job, pilot it now. For the wrong one, the integration cost buys you little that a well-prompted frontier model wouldn’t.

Orchestration

New AI optimization framework beats Claude Code and Codex by 2.5x on the same compute budget

Imagine your engineering team just deployed an AI agent to search through internal company documents and answer employee questions. It works perfectly in development, but in production, it consistently hallucinates or misses key constraints. Fixing this is rarely a simple patch. It requires a tedious, trial-and-error process of tweaking chunking strategies, retrieval methods, and system prompts simultaneously. Because these adjustments are entangled, it becomes nearly impossible to attribute which specific tweak actually solved the problem.

To address this challenge, researchers at Renmin University of China and Microsoft Research introduced Arbor, a framework that upgrades AI-driven research and optimization from a sequence of trial-and-error guesses into a cumulative learning process. Arbor organizes hypotheses, experiments, and insights into a tree that helps the system learn from prior failures to make smarter, verified improvements over time.

In practical tests, Arbor delivered more than 2.5 times the verifiable performance gains of standard AI coding agents across real-world engineering tasks while operating under the same resource budget.

For enterprise AI, this technique directly translates to automating the continuous improvement of complex, real-world engineering systems.

Understanding the bottleneck in autonomous optimization

As large language models and AI systems become more capable, they are expected to carry out more complex operations such as autonomous optimization (AO) of software systems such as agent harnesses or model training algorithms.

AO captures the fundamental loop of autonomous research. An AI agent starts with an initial mutable artifact, such as a machine learning codebase or data pipeline, and a specific objective. The agent’s goal is to iteratively improve this artifact through experimental feedback without step-by-step human supervision.

The main challenge of AO is often misunderstood. Many engineering teams find that simply giving a coding agent more time or compute to optimize a codebase doesn’t lead to better results. “Automation can keep an AI working for a very long time — but a loop is not the same as progress,” Jiajie Jin, co-author of the paper, told VentureBeat. “If the goal is vague, or the metric is easy to hack, long-running automation often just produces ‘improvements’ faster that nobody actually wants.”

Jin explains that complex tasks take many attempts to get right, and standard agent architectures are missing the critical data structure to maintain state. “How do you make sure the insight and experience from each attempt actually accumulate, instead of getting lost in a scrollback buffer?” he said. Without this structure, agents simply repeat the same mistakes.

Current agent systems can run experiments for many hours against well-specified goals: editing code, invoking tools, running tests autonomously. But they treat each attempt in isolation, missing the structural mechanisms that would let them accumulate and act on what they’ve learned.

They lack the capacity to simultaneously maintain and compare multiple competing research directions. Without this, they cannot interpret both successes and failures to reshape their future exploration, which is the core mechanism that makes human research cumulative.

General coding agents typically rely on conversation transcripts for their memory. Because AO tasks span hundreds of turns and easily exceed context window limits, these agents struggle to preserve and reuse factual evidence over long histories. As a result, they lose the overarching structure of the research process and are prone to stalling on early failures or chasing noisy evaluation swings. The system needs a structured, durable memory that records what directions have been tried, what factual evidence was produced, and how each result changes the space of future hypotheses.

Existing frameworks are also prone to reward hacking and overfitting to development metrics. This makes them create the illusion of progress without producing improvements that transfer to real-world performance.

Finally, general-purpose coding agents typically chain their tool calls on a single shared working tree. This architectural limitation prevents them from testing parallel hypotheses in isolated environments without corrupting the main codebase or obscuring which hypothesis caused a specific outcome.

The Arbor framework

Arbor solves the challenges of AO with a framework that automates the long-horizon loop of exploration, experimentation, and abstraction that characterizes human research. Arbor separates the strategic direction of research from the ground-level coding tasks with two key components:

The coordinator: A long-lived AI agent that acts like a principal investigator. It never directly edits the target codebase. Instead, it owns the general state of the optimization research, observes accumulated evidence, comes up with new hypotheses and directions to explore, and decides what to do with the results of experiments.

Executors: Short-lived, highly focused AI agents. When the coordinator wants to test an idea, it spins up an executor and places it in an isolated environment, essentially a fresh git worktree. Each executor is handed one hypothesis. It implements the assigned idea, runs evaluations, debugs errors, and reports back to the coordinator with the results and created artifacts.

These two components collaborate through a mechanism that the researchers call “Hypothesis Tree Refinement” (HTR). HTR represents the entire research process as a persistent, branching tree where every node binds together four things: a hypothesis, the executable artifact, the factual evidence produced, and a distilled insight. This means the coordinator can explore multiple competing directions at the same time without losing its place.

The coordinator builds the tree by placing broad ideas near the root, while concrete refinements branch out as leaves. This allows Arbor to safely explore multiple competing hypotheses simultaneously. If an executor’s experiment fails, the tree records why it failed as a negative constraint, ensuring the system doesn’t endlessly repeat the same mistake.

To understand why Arbor’s isolation matters, consider a common enterprise scenario: optimizing a Retrieval-Augmented Generation (RAG) pipeline for an internal AI assistant. “When you ask a single agent like Claude Code or Codex to ‘improve accuracy,’ it will typically change a bunch of things in one pass — chunking, the prompt, the retrieval method,” Jin said. This entangles the changes, making it impossible to attribute which one actually helped. It also directly mutates the repository without isolation.

Arbor solves this by treating each lever as a separate hypothesis. Chunking becomes one branch, retrieval another, and the prompt another — each implemented and evaluated in its own isolated git worktree. “So you get clean attribution: ‘constraint decomposition on the retrieval side gave +X; breadth-first search actually hurt,'” Jin said.

When an executor returns a report, the coordinator writes the evidence to the tree and backpropagates the insight upward to parent nodes. This means a local observation becomes a generalized constraint that shapes the coordinator’s future idea generation.

To prevent reward hacking or overfitting to the development data, HTR enforces a strict “merge gate.” Even if an executor reports a fantastic development score, the coordinator will spin up an isolated worktree to test the candidate against a held-out test evaluator. The artifact is only merged into the current best trunk if it demonstrably improves the test score, verifying that the progress is real.

Arbor generally falls under the concept of “loop engineering,” popularized by industry figures like OpenClaw creator Peter Steinberger and Claude Code lead Boris Cherny. The idea is to move beyond single prompts to design iterative cycles (observe, reason, act, verify) that drive autonomous agents. However, as Jin points out, “A loop can fill up with messy, untraceable attempts, and you end up with nothing to show and no way to reconstruct what changed.”

Arbor in action

The researchers evaluated Arbor on an autonomous optimization task suite built from real-world research settings and the MLE-Bench Lite machine learning engineering benchmark. The AO suite featured tasks from different areas of AI development, including model training, harness engineering, and data synthesis.

The researchers used different backbone models for the coordinator and executor agents, including Claude Opus 4.6, GPT-5.5, and Gemini-3-Flash. They tested Arbor against the strongest coding agents, Codex and Claude Code. Arbor and the baselines were given the same resources. For the MLE-Bench Lite tasks, Arbor was also compared against top-tier agentic research systems like AI-Scientist, ML-Master, and AIDE.

Arbor consistently outperformed the baselines. It achieved the best held-out test result on all tasks, attaining more than 2.5 times the average relative gain of Codex and Claude Code. On the BrowseComp task, which involves optimizing a search agent, Arbor improved the system’s held-out accuracy from a baseline of 45.33% to 67.67%. Meanwhile, Codex and Claude Code stalled at 50% and 53.33%, respectively. On MLE-Bench Lite, when equipped with GPT-5.5, Arbor achieved the strongest result among all benchmarked systems.

Arbor proved to be resilient against overfitting. For example, during the Terminal-Bench 2.0 task experiments, Claude Code achieved a high development score of 75 but its score dropped to 71 on the held-out data. Arbor had a lower development score of 72.22 but achieved the highest held-out score of 77.36, ensuring its results transfer to real-world applications.

Arbor also showed generalization in a cross-task transfer experiment. After Arbor finished optimizing the search harness for the BrowseComp task, researchers took the optimized codebase and tested it on two unrelated search-agent tasks, HLE and DeepSearchQA. Arbor’s optimized codebase significantly improved performance on those unseen tasks as well.

Deploying Arbor: Sweet spots and hidden costs

For engineering leads looking to drop Arbor into their existing tech stack, the framework is designed to sit on top of existing Git workflows rather than replacing them. “Its output is an ordinary git branch that your existing code review, CI, and human review can inspect directly,” Jin said. Only verified gains are merged into a per-run trunk, leaving the main repository untouched until a developer manually chooses to promote the code.

However, deploying Arbor comes with specific tradeoffs. Jin points out that the biggest catch is token cost, as maintaining a long-lived coordinator that continuously manages the tree and dispatches executors is the dominant expense. Running multiple isolated worktrees concurrently also requires genuine compute and disk resources to process real experiments.

So where is Arbor’s sweet spot? According to Jin, it excels at tasks with a clear, trustworthy metric, tolerance for a long time horizon, and a real search space with several plausible directions, such as pipeline optimization, data-synthesis quality, and model-training recipe tuning.

Conversely, teams should explicitly avoid using Arbor for real-time latency tasks, obvious one-line fixes, or when the underlying evaluation metric is flawed. The quality ceiling of the entire run is strictly bounded by the quality of the evaluator. “If the metric isn’t trustworthy, Arbor will just optimize toward an untrustworthy result faster,” Jin said.

Jin sees the next evolution going beyond single scalar metrics. “A natural evolution is to have each node’s artifact carry a vector — accuracy, latency, cost — instead of a single score,” Jin said. “Going from a single scalar to a multi-objective Pareto search is a very natural extension of the framework.”

Orchestration

Adobe embeds agentic AI workflows across Creative Cloud, shifting from media generation to production orchestration

Adobe has announced a major expansion of its “creative agent” across its flagship Creative Cloud suite and upgraded Firefly AI studio.

Available in public beta starting today across Premiere Pro, Photoshop, Illustrator, InDesign, and Frame.io, the agent is designed to serve everyone from individual creators to enterprise marketing teams.

Unlike first-generation generative AI tools that simply output flat media from a chat interface, Adobe’s embedded assistant acts as an orchestration layer.

It interprets natural language prompts and directly accesses the underlying software’s APIs to execute complex, multi-step production workflows—from batch-renaming video sequences to dynamically updating brand assets across print layouts—while leaving the final aesthetic decisions entirely in the hands of the human designer.

Technology: Contextual Memory and DOM Manipulation

At the core of this release is a significant technical upgrade to how Adobe’s AI handles persistent memory and context window management. In its upgraded Firefly creative AI studio—currently in private beta—Adobe has introduced two foundational architectural components: “Elements” and “Projects”.

Elements functions as a visual variables library, allowing users to save and reuse specific characters, locations, and objects across multiple generations to ensure strict visual consistency as campaigns scale.
Projects acts as the contextual memory layer, storing assets, generations, and session history in a unified space so users can pick up where they left off without rebuilding their prompt context.

Beyond pixel generation, the system’s most critical technological leap is its ability to operate seamlessly within the complex document structures of desktop applications. “Our Adobe Creative Agent can leverage the decades of powerful features, workflows, APIs that we’ve brought into our application and exposed through tooling that can now be invoked through a creative agent,” an Adobe representative explained.

Product: Automating the Tedious, Expanding the Canvas

The practical application of this technology fundamentally alters standard production workflows. Adobe is positioning the human user as a “creative director” capable of delegating repetitive, labor-intensive tasks to the AI. The rollout introduces highly specific specialist agents tailored to the logic of each application:

Premiere Pro: The agent handles tedious project setup, analyzing and sorting source media into bins, batch renaming clips, identifying interview questions, and assembling a rough working starting point.
Illustrator: The assistant automates mathematical and multi-step design tasks, such as generating 50 versioned files from a spreadsheet or running pre-flight checks to flag color mode errors before printing. It can even programmatically duplicate a vector shape 100 times, randomize its position, and change its size based on its z-depth and transparency.
Photoshop & InDesign: The agent executes batch background removals, dynamic layer organization, and applies brand updates across multi-page layouts.

Furthermore, Adobe is actively integrating its creative agent into major third-party enterprise platforms, including OpenAI’s ChatGPT, Anthropic’s Claude, Microsoft 365 Copilot, and soon, Google Gemini and Slack.

Licensing: Commercial SaaS and Enterprise Implications

Unlike open-source orchestration frameworks or models released under MIT or Apache licenses, Adobe’s creative agent operates strictly within a proprietary, commercial SaaS ecosystem. For enterprise decision-makers, this carries specific implications. Because the agent relies on Adobe’s proprietary APIs to manipulate project files, it requires an active Creative Cloud commercial license. Additionally, by bringing the “Adobe for creativity connector” to platforms like Slack and Microsoft Copilot , enterprise IT and systems architects must consider how internal chat tools will interface with Adobe’s cloud processing environments to support enterprise creative and marketing teams securely.

The Enterprise Unknowns: APIs, Governance, and Architecture

While Adobe’s announcements highlight a powerful user interface and deep integration within its own flagship applications, several critical questions remain for enterprise technical decision-makers tasked with building bespoke AI systems. VentureBeat has reached out to Adobe for clarification on these infrastructure-level details and will update this coverage as we learn more.

For AI system architects, the value of a creative agent lies not just in a native application UI, but in its extensibility. It remains unclear if Adobe plans to expose these new agentic capabilities via API, or if the company will support the Model Context Protocol (MCP). Without MCP support or direct API access, enterprise teams will face friction integrating Adobe’s tools into their own custom task-routing frameworks and internal LLM pipelines.

Adobe’s new “Elements” feature promises to solve the generative AI consistency problem by anchoring characters and objects across generations.

However, the backend architecture driving this persistent memory is not yet detailed. Whether Adobe is leveraging on-the-fly Low-Rank Adaptation (LoRA) based on user uploads or utilizing a form of visual Retrieval-Augmented Generation (RAG) is a critical distinction for technology leaders managing compute costs, model evaluations, and enterprise-grade inference pipelines.

As organizations build out “Projects” and define brand-specific “Elements”, security and data decision-makers require strict guarantees regarding data provenance and storage. It is currently unknown exactly where this contextual workflow and vector data lives—specifically, whether it remains strictly sandboxed within the customer’s enterprise Creative Cloud instance on Adobe servers, and how role-based permissions apply to these new agentic workflows.

Finally, as lightning-fast, developer-first, multi-model AI creative platforms like fal.ai gain significant traction among enterprises and developers, Adobe’s position in the broader developer ecosystem remains a point of interest.

Whether Adobe views these infrastructure-level API providers as direct competitors to its Firefly AI studio or as potential integration points for bespoke enterprise environments has yet to be seen.

Community Reactions: The Tension Between Automation and Craft

The integration of agentic AI touches on the tension between eliminating drudgery and surrendering creative control. According to Adobe’s recent Creators’ Toolkit Report, which surveyed over 16,000 creators globally, the market is highly receptive to AI as an operational assistant rather than an autonomous creator.

75 percent of surveyed creators describe creative AI as integrated or essential to their current workflows.
85 percent emphasized that the final creative decision must always remain in human hands.

This sentiment is central to Adobe’s messaging. By focusing the agent’s capabilities on file organization, layer management, and brand compliance, Adobe aims to automate what a spokesperson called the “tedious parts of their workflow”. The goal, according to Adobe executive David Wadhwani, is to let creatives focus on the craft so they can “apply their taste and make the calls that only they can”.

Orchestration

Stanford’s DeLM cuts multi-agent task costs 50% — without a central orchestrator

One of the assumptions behind today’s AI frameworks is that agents require a “boss” at the center; this orchestrator runs the show, routes requests, and makes sure the whole system doesn’t descend into chaos.

That assumption may be wrong, and the cost of carrying it could be measured in inference dollars and coordination latency. A new Stanford framework called a decentralized language model, or DeLM, is built on the premise that agents can coordinate directly, without routing every update through a central controller.

DeLM’s shared knowledge base serves as a “common communication substrate” so that agents can build upon one another’s verified progress without having to route every interaction through a main agent to “merge, filter, and rebroadcast,” Yuzhen Mao and Azalia Mirhoseini, co-developers of the framework, explain in a research paper.

It’s a system that’s not only possible, but desirable in certain instances. “Agents can build on prior findings, avoid repeated failures, preserve constraints, and recover detailed evidence only when needed.”

The challenges of traditional multi-agent systems

In a typical centralized multi-agent system, a main agent breaks tasks into subtasks, assigns them out to multiple sub-agents in parallel, waits for responses, merges and summarizes intermediate progress, then launches a next wave of orders based on collected context.

While this is a natural way to scale large language model (LLM) reasoning, the Stanford researchers argue that it scales poorly. Every useful finding, partial finding, and failure must be reported back to the main agent, which then determines what information to merge and rebroadcast to the agents below it.

“As the number of subtasks grows, this controller becomes a communication and integration bottleneck,” Mao and Mirhoseini write. Further, the main orchestrator may “dilute, omit, or distort” useful information, leading to lost progress.

This bottleneck also occurs in long-context reasoning scenarios. Once it receives reports back from subagents, a main agent will typically group related concepts, data points, and other materials together in an unsupervised learning loop. It may then pre-assign these ‘evidence clusters’ to sub-agents before knowing what surfaced material is actually relevant or whether it’s combined correctly.

When a subagent receives this insufficient context, it will essentially get confused and return to the main agent, kicking off another retrieval or delegation round. “This back-and-forth makes coordination slower, more iterative, and increasingly constrained by a single overloaded main agent,” the researchers write.

What DeLM addresses and how it works

DeLM, by contrast, is built around parallel agents, a shared context, and a task queue.

Shared context is essentially a curated store of “gists,” or information summaries that other agents might find useful. These include verified and evidence-based findings alongside partial findings and documented failures; they also point to detailed evidence that agents can pull from based on their specific task.

A task queue is then a set of subsequent pending subtasks that agents can claim independently.

“Agents write compact, verified updates into a shared context that later agents can read directly,” the researchers write. Useful findings, failures, and constraints accumulate as a “shared problem state,” rather than passing through a central controller.

The pipeline looks like this:

Initialization: Inputs are broken into different work units and added to a queue;
Parallel execution: Agents work independently and in tandem, pulling tasks and reading shared context as they progress.
Compression and verification: Results are compressed into reusable “gists” that are checked against supporting evidence. Only gists that are fully verified are shared with the group.
Additional work (if needed): When the queue is emptied, the last agent to return an answer inspects all the shared context to determine whether further work is required.
Final step: The last agent determines that no more steps are required and returns the final answer.

Agents “exchange progress through shared state, asynchronously claim ready tasks, and scale more adaptively as the number of subtasks grows,” the researchers explain.

How DeLM performs in the wild

With DeLM, agents can avoid redundant exploration; reuse and build on each other’s discoveries and failures; and focus on unresolved issues.

The framework can be particularly useful in software engineering test-time scaling, when models are given time to “think” to improve their reasoning and problem-solving capabilities. Different agents can explore their own hypotheses or pursue reasoning paths in parallel, while still sharing intermediate progress. One example is concurrent de-bugging.

DeLM is also suitable for long-context reasoning and multi-document question-answering; agents can simultaneously examine their own evidence clusters (collections of papers, code, or other materials) at the same time, while maintaining a “global compact view” of accumulated evidence.

The researchers contend that it makes agentic tasks more accurate and significantly cheaper. This is backed by its performance on real-world benchmarks: On SWE-bench Verified — which evaluates how well AI models and agents solve real-world software engineering problems — it performed 10.5% better than the strongest baseline and reduced cost per task by roughly 50%.

But it can go beyond coding: On LongBench‑v2 Multi‑Doc QA — which assesses LLMs’ ability to handle long-context, real-world problems — DeLM had the highest accuracy across four model families, including GPT‑5.4, Claude Sonnet, Gemini Flash, and DeepSeek‑V4‑Pro.

DeLM outperforms other models on SWE-Bench for a number of reasons, as Mao detailed on X.

First, agents share failures. In ordinary parallel runs, when one agent follows the wrong path, that failure stays private, and subsequent agents may waste time (and money) pursuing the same dead end. But with DeLM, failed hypotheses are written into shared context.

“Later agents can read them as constraints, avoid repeated exploration, and redirect their search toward more promising fixes,” Mao said.

Additionally, constraints, once verified, are immediately added to agents’ shared context. This means they become a binding shared state. “Later agents inherit them, build around them, and avoid repeating globally invalid simplifications,” Mao said.

Crucially, DeLM keeps shared progress compact enough to reuse. It is unfoldable, meaning agents see short gists by default, but can choose to unfold them into more detailed summaries and raw evidence.

As the researchers note, providing all raw documents and traces gives agents the maximum amount of information, but that can overwhelm their context windows and ultimately increase costs.

“If agents shared full traces, each worker would need to read long command histories, file dumps, failed edits, and intermediate reasoning, turning coordination itself into another long-context bottleneck,” Mao said.

On the other hand, while sharing compact summaries is cheaper, important details and evidence can be lost, resulting in less reliable reasoning.

Unfolding, therefore, provides “coarse-to-fine” opt-in access. This can improve accuracy and cost.

Ultimately, with a framework like DeLM, agents can be more efficient because they are prevented from repeatedly reading the same documents or rerunning the same failed analysis; more effective because useful findings are propagated across parallel threads; and more robust because they only share verified claims.

For enterprise builders, DeLM challenges a core assumption: that every multi-agent workflow needs a central controller. The SWE-bench and LongBench-v2 results suggest the decentralized model isn’t just theoretically cleaner — it’s faster, more accurate, and roughly half the cost.

Orchestration

Vibe coding can build your pipeline. It can’t explain it six months later

AI coding agents are rapidly accelerating data engineering by generating transformations, pipelines, orchestration workflows, validation tests, and infrastructure configurations from prompts. However, enterprise data platforms have long operated across…

DataDecisionMakers, Orchestration

MCP solved tool calling. A2A solved coordination. What solves transport?

The history of distributed computing is one of protocol proliferation followed by consolidation.

Common Object Request Broker Architecture (CORBA), Distributed Component Object Model (DCOM), Java remote method invocation (RMI), and early simple object access protocol (SOAP) competed for the enterprise integration market in the late 1990s before representational state transfer (REST) quietly won by being simpler and HTTP-native.

Extensible Messaging and Presence Protocol (XMPP), Internet Relay Chat (IRC), and a dozen proprietary protocols fragmented real-time messaging before MG telemetry transport (MQTT) and WebSockets carved out their respective niches. Every new computing paradigm generates a burst of competing standards, then slowly converges as implementations accumulate and interoperability becomes economically necessary.

The AI agent ecosystem is currently in the proliferation phase. Four significant protocols have been published in the past eighteen months: Model context protocol (MCP) from Anthropic in late 2024, agent communication protocol (ACP) from IBM Research in March 2025, Agent2Agent (A2A) from Google in April 2025, and agent network protocol (ANP) from an independent working group.

The W3C AI Agent Protocol Community Group has opened a standards track. The Internet Engineering Task Force (IETF) is receiving Internet-Drafts on agent transport. Conferences are running workshops on interoperability. Every week brings a new GitHub repository claiming to solve the agent communication problem.

Understanding where and how quickly this converges has real consequences for architecture decisions being made right now.

What the protocols actually solve

The proliferation looks more chaotic than it is, because most of these protocols address different layers of a stack rather than competing for the same slot. The confusion comes from marketing, which describes each as “the standard for AI agent communication” without specifying which aspect of communication.

MCP is a tool-calling interface. It defines how a model discovers what functions a server exposes, how to invoke them, and how to interpret the response. It is a typed remote procedure call (RPC) contract between a model client and a tool server, running over HTTP. The Linux Foundation confirmed more than 10,000 active public MCP servers and 164 million monthly Python SDK downloads by April 2026. MCP has already won the tool-calling layer. The standardization work is effectively done.

A2A is a task coordination interface. Where MCP defines how an agent calls a tool, A2A defines how two agents delegate a task. It introduces Agent Cards (capability advertisements), task lifecycle states, and three interaction modes: Synchronous, streaming, and asynchronous. Google donated it to the Linux Foundation in June 2025, and enterprise AI teams have adopted it broadly because it fills a real gap that MCP leaves open.

ACP is a message envelope format. Lightweight, stateless, designed for agent-to-agent message exchange without A2A’s full coordination semantics. It is useful in systems where simple message passing suffices and A2A’s task lifecycle overhead is unnecessary.

ANP is a discovery and identity protocol. It uses Decentralized Identifiers (DIDs) for agent identity and JSON-LD graphs for capability descriptions, providing a foundation for decentralized agent marketplaces where no central registry is required.

The stack that is emerging: Capability discovery via ANP or simpler registries, task coordination via A2A, tool calls via MCP, and lightweight messaging via ACP for cases that do not require full task lifecycle management. These layers complement rather than compete.

The transport problem that remains

Every protocol in this list runs over HTTP. This reflects where the protocols came from: Research teams, API providers, and enterprise software companies building systems where HTTP is an unquestioned assumption. HTTP is the protocol they know, the one their servers already speak, and the one that makes demos easy.

The production problem is that HTTP assumes a reachable server. Behind network address translation (NAT) — and 88% of networked devices sit behind NAT — there is no reachable server without a relay. For agent fleets that need to route tasks directly between peers across cloud boundaries, home networks, and edge deployments, this centralization forces every message through relay infrastructure. Relay infrastructure adds latency, cost, and a failure mode.

The application-layer protocols solve the semantics of what agents say to each other. They do not solve how agents find each other and establish direct connections. That is a session-layer problem, Layer 5 in the open systems interconnection (OSI) model and none of MCP, A2A, ACP, or ANP address it.

The technologies for solving it exist. UDP hole-punching with session traversal utilities for NAT (STUN) provides NAT traversal for roughly 70% of network topologies. X25519 Diffie-Hellman and AES-256-GCM provide authenticated encryption at the tunnel level without a certificate authority. Quick UDP internet connections (QUIC) (RFC 9000) or custom sliding-window protocols over user datagram protocol (UDP) provide reliable delivery without TCP’s head-of-line blocking. These are the same primitives that WireGuard uses for VPN tunnels and that WebRTC uses for browser-to-browser media streams.

What differs in the agent context is capability-based routing. Agents need to find peers not by hostname but by what those peers can do. A research agent should be able to query “which peers have real-time foreign exchange data?” and receive a list of currently active specialist agents. This is closer to a service registry than to DNS, and it is a natural extension of ANP’s design philosophy applied to the transport layer.

A handful of projects are assembling these pieces. Pilot Protocol has the most complete published specification, with an IETF Internet-Draft covering addressing, tunnel establishment, and NAT traversal for agent networks. libp2p provides a battle-tested foundation with similar primitives. The IETF’s QUIC working group is developing NAT traversal extensions that will be relevant here.

What convergence will look like

The HTTP-based protocols (MCP, A2A) are already converging on stable versions. The next 12 months will see production hardening, security improvements, stateless MCP servers for horizontal scaling, better A2A federation — rather than new fundamental designs. The tool-calling and task-coordination layers are largely solved.

The transport layer is 18 to 24 months behind. Expect a period of implementation diversity as teams experiment with different approaches to peer-to-peer (P2P) agent networking, followed by consolidation around a small number of implementations once empirical data on performance and reliability accumulates. The IETF and W3C standardization tracks will likely produce something in the 2027-2028 window, by which time one or two open-source implementations will have accrued enough production deployments to establish de facto standards ahead of the formal specification.

For engineering leaders making architecture decisions today, the practical implication is layered adoption. The application-layer protocols are stable enough to build on. MCP adoption now is low-risk. A2A adoption for multi-agent coordination is reasonable with the expectation that the protocol will evolve. The transport layer is where you either build something custom and plan to replace it, or you evaluate early implementations knowing the space is still moving.

The teams that will have the most leverage when the transport layer stabilizes are the ones that designed their agent systems with a clean separation between application semantics (MCP, A2A) and transport (whatever sits below). Clean separation is cheap to implement now and expensive to retrofit later, a lesson the microservices era taught anyone who tried to add observability or circuit breaking to systems that had none.

Philip Stayetski is a co-founder of Vulture Labs.

DataDecisionMakers, Orchestration

Google researchers introduce ‘faithful uncertainty,’ allowing LLMs to offer best guesses instead of hallucinations

Large language models continue to struggle with hallucinations, presenting a major roadblock for real-world enterprise applications. Reducing these errors is a messy business, forcing model developers to navigate a strict tradeoff where eliminating factual errors often suppresses valid answers.

In a new paper, Google researchers introduce the concept of “faithful uncertainty,” a metacognitive technique that aligns a model’s response with its internal confidence. This alignment allows the model to offer appropriately hedged hypotheses, such as “My best guess is,” instead of defaulting to an unhelpful “answer-or-abstain” binary.

In real-world agentic AI applications, this metacognitive awareness acts as an essential control layer. It empowers autonomous systems to accurately determine when their internal knowledge is sufficient and when they must dynamically trigger external tools or search APIs to resolve deficits.

The utility tax of current mitigation strategies

Understanding why LLMs hallucinate hinges on separating two capabilities: a model knowing facts versus knowing what is known. Historically, most factuality gains in AI have come from expanding the knowledge boundary, meaning developers simply pack more facts into the model’s parameters through larger scale and more training data.

However, expanding a model’s knowledge does not automatically improve its boundary awareness, which is its ability to distinguish the known from the unknown and recognize its own limitations.

“There are broadly two ways to improve LLM factuality,” Gal Yona, Research Scientist at Google and co-author of the paper, told VentureBeat. The first is continuing to teach the model more facts. But, Yona notes, “model capacity is finite, and the long tail of knowledge is effectively infinite.”

Once models hit this limit, the hope is they know what they don’t know and simply abstain from answering. However, this is inherently difficult for LLMs.

“This is why most practical attempts to reduce hallucinations through various interventions don’t actually make it to deployment,” Yona explains. “They do reduce hallucinations, but they also hurt utility, because the model ends up refusing to answer questions it actually does know.”

This inability to distinguish between knowns and unknowns creates what the paper’s authors call the “utility tax.” Enforcing a zero-hallucination standard requires the model to abstain whenever it is even slightly uncertain, discarding massive volumes of completely valid information. For example, the authors demonstrate that reducing an underlying 25% error rate down to a strict 5% target forces developers to discard 52% of the model’s correct answers.

Treating all errors as hallucinations forces enterprise systems to choose between trustworthiness and helpfulness. Application developers are generally unwilling to pay this massive utility tax and render their models unhelpful.

Consequently, they optimize systems to prioritize coverage, forcing models to operate in a state where they continue to generate confident hallucinations.

Reframing hallucinations as confident errors

To move past the utility tax, the researchers propose to stop treating any factual error as a hallucination. Instead, they reframe hallucinations as “confident errors”: incorrect information delivered authoritatively without appropriate qualification.

This subtle reframing dissolves the strict “answer-or-abstain” dichotomy and allows the model to express its uncertainty.

In this new framework, if a model makes a factual mistake but appropriately hedges its response (e.g., by stating, “I am not completely sure, but I think…”), it isn’t a hallucination. It is simply a hypothesis offered to the user for consideration. By expressing uncertainty, the AI preserves its utility—sharing whatever partial or likely knowledge it has—without violating the user’s trust.

However, if an AI assistant hedges all its responses with a disclaimer, the user is forced to double-check everything, defeating the purpose of the tool entirely.

The solution the researchers propose is “faithful uncertainty.” This approach requires aligning a model’s linguistic uncertainty, or the words it uses to express doubt, with its intrinsic uncertainty, which is its actual, internal statistical confidence in that specific answer. This ensures the model only hedges when its internal state genuinely reflects conflicting or low-probability information.

Faithful uncertainty forms a core component of “metacognition,” the AI’s ability to be aware of its own uncertainty and act on it. To understand this practically, consider the intuitive example of consulting a doctor. We do not trust doctors because they are all-knowing. We trust them because they reliably distinguish between a confident diagnosis (“You have a fracture”) and an educated hypothesis (“It might be a sprain, but let’s run some tests”).

Practical implications for enterprise AI

Under the new framing, errors where a model is genuinely confident but factually incorrect are categorized as “honest mistakes.” This casts knowledge expansion (training the model on more data) and faithful uncertainty as completely complementary efforts. Knowledge expansion pushes the absolute knowledge boundary outward to minimize honest mistakes, while faithful uncertainty honestly communicates wherever that boundary currently lies.

This new framing has important implications for agentic applications. The shift to agentic AI might make it seem like knowing what the model doesn’t know is redundant, since models can just search external databases. However, access to external tools actually amplifies the need for faithful uncertainty. In agentic systems, metacognition becomes the central control layer that governs the entire system.

External tools solve the storage problem because the model no longer needs to encode every fact into its parameters. However, this introduces a new control problem: managing when to retrieve information, verify facts, and orchestrate these external tools. Without faithful uncertainty, an agent is essentially flying blind and must rely on external, static heuristics or over-engineered scaffolds.

“The model might search for something it already knows confidently—wasting latency and cost for no gain. Or the opposite: it confidently answers from memory when it should have searched, producing a plausible but wrong output,” Yona said. Today’s agent harnesses try to solve this externally with query classifiers or always-search rules, but Yona notes that these are “static and brittle.” By using its intrinsic uncertainty to regulate its own behavior, the agent dynamically optimizes its tool use, choosing to invoke a search tool only when its internal confidence is genuinely low.

Beyond deciding when to search, faithful uncertainty is critical for evaluating the results of a search. If a tool returns low-quality or unexpected information, a metacognitive agent does not blindly accept whatever appears in its context window. Instead, it uses its uncertainty awareness to weigh the retrieved external signals against its own internal priors. This prevents sycophantic behavior where the system might otherwise trust external sources that conflict with its actual known knowledge.

The bootstrapping paradox: The catch to teaching uncertainty

For enterprise builders, achieving this faithful uncertainty is trickier than it sounds. It requires teaching models the syntax of uncertainty through supervised fine-tuning (SFT). Because pre-trained models are mostly fed authoritative text, they must be explicitly taught to say things like, “I’m not entirely sure, but I think VentureBeat was founded in…”

But SFT introduces a “bootstrapping paradox.” Unlike standard training datasets where the “right answer” is the same regardless of the model, the ground truth for uncertainty is the model’s own dynamic knowledge base.

“Here’s the catch: the ‘correct’ expression of uncertainty is inherently dynamic, because it depends on what this particular model knows or doesn’t know at this particular point in training,” Yona said. “If you train on a label that says ‘I don’t know X’ but the model actually does know X, you’ve taught it to hallucinate uncertainty… The training data is static, but the target is a moving one, and that’s the fundamental tension teams need to grapple with.”

The road to self-aware AI

For enterprises looking to implement these capabilities without expensive retraining, prompting serves as the most accessible entry point. “Prompt engineering is already something most engineers do today, this provides the lowest-friction path to improving metacognitive behavior today,” Yona said. Enterprise developers can explore frameworks like MetaFaith, an open-source project previously co-authored by Yona, to begin applying metacognitive prompting to off-the-shelf models.

However, Yona cautions that “there is still substantial headroom that prompting alone doesn’t solve,” meaning the industry will eventually need to rely on advanced reinforcement learning (RL) to bake metacognition deeply into model training.

Ultimately, as enterprises transition from isolated chat applications to complex, multi-agent workflows, self-awareness will become a defining prerequisite for reliable autonomy. But evaluating whether a model truly possesses this awareness remains a profound technical challenge.

“How do you actually evaluate whether a model can sense its internal states?” Yona asks. “Even in humans, it’s hard to define or separate ‘true’ self-monitoring abilities from a capable reliance on proxies. We face exactly the same challenges with LLMs: a model might learn to mimic the style of uncertainty without truly sensing its internal state. Developing evaluation frameworks that can tell the difference is one of the most important open problems in this space.”

Orchestration

Microsoft’s open-source SkillOpt automatically upgrades AI agent skills without touching model weights

Agent skills have become an important part of real-world AI applications, providing a mechanism — a set of instructions saved in a folder of text-based markdown (.md) files, usually — for models to adapt to specific enterprise use cases and complex workflows.

However, optimizing these skills is a slow process and faulty process, as they cannot be trained in the same way as the parameters of the underlying AI model. Instead, users typically must update them manually by retyping the instructions in each file, playing a “guessing game” as to what changes might improve agentic AI performance and reduce errors.

SkillOpt, a new, open source (MIT Licensed) framework developed by Microsoft, does one better: it introduces an optimizer designed for agent skills, turning the agent’s skill .md document as a trainable object that evolves based on performance feedback.

It uses deep-learning-style optimization to make it possible for the AI to systematically explore modifications to the document and find the best combination of instructions. Most importantly, it accomplishes this procedural adaptation without making changes to the underlying model’s weights.

On various industry benchmarks, SkillOpt outperforms existing baselines, significantly boosting accuracy for models like GPT-5.5 and Qwen. The result is a set of compact, transferable skill artifacts that allow AI agents to adapt to new domains effortlessly.

The challenge of optimizing agent skills

Agent skills package procedural knowledge into natural-language specifications, including domain heuristics, tool-use policies, output constraints, and known failure modes. These skills provide an external interface for agents to adapt to complex enterprise workflows. In practice, agent skills are stored as text documents and inserted into the agent’s context before execution.

One of the key benefits of skills is that they customize the behavior of the underlying model without changing its weights. However, the skill document itself needs to be tweaked and optimized to get the best performance out of the agent.

While deep learning relies on strict mathematical controls for stability, human prompt engineering often relies on trial and error. When attempting to automatically update a skill document based on feedback, the lack of mathematical discipline makes text highly volatile.

Yifan Yang, Senior Research SDE at Microsoft Research Asia, told VentureBeat that the problem is not making changes, but ensuring those changes are mathematically sound.

“The breaking point isn’t whether a team can change a skill, it’s that they can’t guarantee the change is an improvement,” Yang said. “Three failure modes recur: no step-size control, so skills drift; no validation, so a fix that reads as reasonable gets written in and can quietly regress performance; and no negative memory, so the same failed edit keeps coming back.”

To illustrate how easily performance can drop when edits aren’t mathematically validated, Yang noted that “an ungated rewrite pushed GPT-5.5 on SpreadsheetBench from 41.8 down to 41.1.”

According to Yang, these failure modes are amplified in multi-step workflows “because that’s where frontier models are weakest zero-shot. Not on reasoning, but on procedural discipline: format, self-verification, tool policy.”

Before SkillOpt, agent skills were primarily hand-crafted, generated in a single shot, or evolved through loosely controlled self-revision pipelines that could not reliably improve under feedback.

Prompt optimization methods like TextGrad and GEPA treat language artifacts as optimizable objects and use trajectory feedback to evolve prompts, but they focus on single-prompt configurations rather than generating persistent, reusable skill artifacts.

Meanwhile, skill evolution and discovery methods like EvoSkill and Trace2Skill convert agent execution experiences into trajectory lessons to refine skill folders, build domain-specific libraries, or perform evolutionary search.

None of them apply deep-learning-style controls, such as learning rates, validation gates, and momentum, which are necessary to continuously train a single, compact skill document.

Importing mathematical discipline to text

SkillOpt optimizes a text document through an iterative propose-and-test loop that separates the model executing the tasks from the model optimizing the skill. The process unfolds in several steps:

SkillOpt starts with an initial skill document and a frozen target model (or harness), where the target model runs a batch of tasks to generate execution trajectories that act as the evidence for the current step.
An offline optimizer model analyzes these trajectories, separating successes from failures into minibatches. Looking at a minibatch helps the model identify systematic procedural errors rather than one-off anomalies. Based on these patterns, the optimizer proposes structural add, delete, or replace edits to the skill document.
The proposed edits are reviewed to filter out duplicates or contradictions, and the optimizer then ranks these candidate edits by their expected utility.
Rather than applying all proposed changes, SkillOpt clips the list to a maximum edit budget for that step, generating a candidate skill.
The candidate skill is evaluated on a held-out validation set using the target model. If the candidate improves the validation score, it is accepted and becomes the new current skill. If it fails, the edits are rejected and sent to a rejected-edit buffer, providing negative feedback so the optimizer knows not to repeat that mistake.

SkillOpt directly addresses the problem of treating text as a trainable object by importing mathematical concepts from deep learning. The creators note that “the deep-learning analogy is operational rather than decorative,” helping the framework avoid the instability issues associated with other optimization techniques.

The edit budget acts as a learning rate. By limiting how many edits can be applied at once, the skill version is prevented from moving too far from its previous state, preserving continuity while allowing new procedures to be acquired.

Just like checking validation loss in deep learning, the strict held-out examples ensure that plausible-sounding text edits are only kept if they mathematically improve the agent’s actual performance on the validation split.

At the end of an epoch, SkillOpt performs a slow update by comparing tasks under the previous and current epoch’s skills. This acts like a momentum term, carrying durable, long-horizon procedural lessons forward while isolating them from the fast, step-level edits.

SkillOpt in action

To evaluate the technique in practice, researchers tested SkillOpt across different models, ranging from large-scale frontier models like GPT-5.5 to smaller closed and open models including GPT-5.4-mini and Qwen3.5-4B. They also deployed the skills within different execution harnesses, using plain chat as well as complex coding harnesses like the Codex CLI and Claude Code.

The evaluation spanned diverse industry benchmarks including single-round question-answering, multi-round code generation involving tool use, and multimodal document reasoning. SkillOpt was measured against multiple baselines ranging from a default no-skill setting to human-written skills and one-shot LLM-generated skills. It was also compared against advanced prompt-optimization and skill-evolution methods, specifically Trace2Skill, TextGrad, GEPA, and EvoSkill.

SkillOpt dominated across the board, proving highly effective on all 52 evaluated combinations of model, benchmark, and harness. It was particularly effective with frontier models, delivering an average absolute improvement of +23.5 points against the no-skill baseline on GPT-5.5. Furthermore, SkillOpt outperformed a hypothetical oracle baseline that cherry-picks the best competing method for every problem.

Small target models saw immense relative gains, proving that a compact text file can supply procedural knowledge that small models lack in their weights. For example, GPT-5.4-nano nearly doubled its score on multimodal document QA and tripled its score on embodied interaction and sequential decision-making.

These academic benchmarks map to critical enterprise pain points. Zero-shot models often hallucinate formatting or fail to use tools properly in multi-step scenarios. Yang explained that the biggest performance leaps occurred in operations that enterprises historically struggle to automate reliably.

“Document data extraction… exact figures out of contracts, invoices, and forms — AP automation, claims, compliance,” Yang said. “What improves is reliability: precise formatting, self-verification, auditable outputs. And the gains come from learning procedure, not memorizing answers.”

For enterprise practitioners, the true value of SkillOpt lies in its portability, efficiency, and compatibility with existing infrastructure. Experiments confirm that the framework is harness-agnostic. In addition to basic chat, the same optimization loop was successfully integrated into tool-backed execution environments like the Codex CLI and Claude Code with significant gains on industry benchmarks.

Developers can train a skill using one execution loop and deploy it in another. For example, a spreadsheet skill trained entirely inside the Codex loop was moved directly into Claude Code and drove a +59.7 point gain over Claude Code’s native baseline without any further changes.

SkillOpt artifacts also transfer cleanly across model scales. A skill optimized for GPT-5.4 was deployed onto the smaller GPT-5.4-mini and GPT-5.4-nano models with positive gains, proving that the learned procedures encode reusable workflows rather than just exploiting quirks of a specific model’s architecture.

Finally, the framework is highly efficient regarding token usage and context window real estate. Across all benchmarks, the final deployed skills never exceeded 2,000 tokens, with a median length of roughly 920 tokens. This results in highly readable, auditable artifacts that a human practitioner can review and manage in minutes.

Implementation strategies and the enterprise ‘catch’

For enterprise tech leaders, adopting a new framework requires understanding the overhead and limitations. While the research paper notes that training tokens can reach up to 210 million for academic benchmarks, the reality for day-to-day enterprise use cases is much lighter. The high token counts in testing were largely due to re-scoring massive held-out test sets.

“The real upfront work is the verifier and a representative held-out split. The optimizer is light; the evaluation harness is where the engineering goes,” Yang said. He added that for everyday use, “in community frameworks like GBrain, where SkillOpt updates run on Claude Sonnet, training a skill for a single task averages just $1–5.” This optimization cost is a one-time fee that amortizes completely at deployment.

However, the framework requires specific conditions to work effectively, namely a few dozen representative examples and a scorable feedback signal. Teams should avoid applying SkillOpt to open-ended or subjective tasks. “With no clean automatic scorer you have to design a human- or model-based evaluator and watch its stability,” Yang said.

SkillOpt also integrates smoothly with existing orchestration stacks, removing a major adoption hurdle. For instance, developers already using pipeline compilers can run both systems harmoniously. “DSPy is a different, complementary layer,” Yang said. “It compiles declarative LM pipelines and optimizes program structure; SkillOpt optimizes the external skill state a frozen agent loads. You can run them together.”

Looking ahead, open-source developers are already scheduling SkillOpt to run periodically over their agents’ past trajectories, creating a small ecosystem of self-optimizing code-agent plugins. This continuous feedback loop represents a significant shift in how AI systems adapt.

“The valuable version of self-improvement is an agent autonomously discovering knowledge to improve its own behavior and the user experience, under verification and audit,” Yang said. “Skills are the fastest, cheapest, most reversible first step, and the same mindset points toward agents eventually optimizing themselves, all the way down to their own weights.”

Orchestration

What AI benchmarks miss about real-world performance

Presented by F5

Enterprise AI teams have spent years solving for compute, securing GPU allocations, negotiating cloud capacity, and benchmarking training throughput. The assumption embedded in that work is that the path between storage and compute will keep up. In production, that assumption increasingly does not hold. Real traffic introduces latency spikes, network jitter, and node degradation that controlled benchmarks fail to capture, resulting in pipelines that perform well in the lab but stall in deployment. A growing response is AI data delivery, deploying an application delivery controller (ADC) or application delivery and security platform (ADSP) in front of storage as a resilient and secure control point.

“Provisioning solves for capacity but not for delivery, and that is where the constraint now hides,” says Hunter Smit, senior manager of product marketing at F5. “Enterprises buy enough GPUs and enough storage, then assume the path between them will keep up, but AI traffic is bursty, highly concurrent, and random in its reads in ways ordinary storage networking was never built to absorb.”

The production gap benchmarks don’t show

Standard benchmark methodology compounds the problem, says Paul Pindell, principal solutions architect for technology alliances at F5.

“Benchmark testing is usually built to produce the best possible performance or security result, not the most realistic one,” he says. “With S3, latency is a known factor in degrading performance, so meaningful testing has to introduce consistent latency into the path.”

Most benchmark environments never do that, which means the performance numbers enterprises rely on for infrastructure decisions are drawn from conditions that production systems will never replicate. To test this assumption, F5 and MinIO conducted throughput testing under degraded network conditions.

“What stood out was how quickly S3 throughput falls off once you introduce latency,” Pindell says. “Even modest latency takes a real bite out of it, and as latency climbs toward long-haul distances, the degradation gets severe.”

The testing also showed latency mattered far more than jitter as a driver of throughput loss, which inverted what the team had expected going in. The upshot for enterprise architects is that S3 object storage deployments cannot be designed around clean-room assumptions; they have to be engineered for the degraded network conditions they will actually face.

The cost of fragile data paths

“In AI infrastructure, people naturally focus on GPUs because they’re the most visible and expensive resource,” says Tanu Mutreja, senior director of product management at F5. “But in production environments, GPUs generate only as much value as the data path that feeds them.”

That path runs through storage, networking, databases, security, and orchestration layers, often stitched together from multiple vendors. Customers experience none of those seams; they experience the output of the whole system.

When the data path degrades, the effects compound. GPU underutilization is the most immediate and visible symptom, but Mutreja pointed to a wider set of consequences: degraded inference performance, poor-quality AI outputs, higher egress costs from unnecessary data replication, and growing operational complexity.

“At scale, data-path efficiency becomes a strategic business lever rather than technical optimization,” she says. “When the data path is engineered well, GPUs remain productive, AI applications stay responsive and trustworthy, operations scale efficiently, and organizations maximize the return on their AI investments.”

AI workloads are structurally more exposed to these failures than traditional enterprise applications. Databases, ERP systems, and web services absorb transient storage delays through caching and buffering. AI workloads running across massively parallel GPU clusters have no equivalent protection. As Mutreja noted, even minor latency spikes or bandwidth bottlenecks can cascade across large GPU clusters, simultaneously hitting utilization, training efficiency, and the customer experience.

Treating the storage edge as a control point

For decades, storage and intelligence operated as sequential concerns in enterprise architecture: data was stored first, then analyzed downstream. Mutreja argued that this model no longer fits the demands of AI.

“Competitive advantage is determined not only by the volume of data, but also by relevance, lineage, security, and performant delivery of data,” she says. “Across the industry, from NVIDIA and AWS to enterprise storage providers, the movement is toward embedding intelligence directly into data infrastructure rather than stacking it on top.”

F5’s integration with MinIO instantiates this approach at the layer where storage and compute actually interact. As part of the F5 ADSP, BIG-IP sits in the data path, continuously monitoring the health of MinIO’s distributed storage nodes and directing requests only to those that remain available.

The operational impact of that capability becomes clear when nodes degrade, which is expected in distributed storage clusters. Without intelligent routing, clients that land on an unhealthy node must retry and may land on another degraded node, dragging down overall performance.

“F5 makes sure traffic only goes to healthy nodes, or even the least busy ones, so S3 client traffic is always processed in the most efficient way,” Pindell says.

Governance across distributed environments

The challenge grows at scale, when AI pipelines stretch across multiple locations, clouds, or edge environments.

“Once an AI pipeline crosses regions and clouds, the question stops being about performance and becomes about control,” Smit says. “You are operating under different rules in every jurisdiction, and digital sovereignty is now a design constraint. Where your data is allowed to live, who is permitted to touch it, and which borders it cannot cross now shapes the architecture before anyone talks about speed.”

That pressure is driving a visible trend of enterprises repatriating AI workloads from public cloud onto infrastructure they own and govern directly. The architecture Smit described resolves this by decoupling applications from any single storage location and placing a unified control point between them that enforces consistent policy across all of them.

“Sovereignty, resilience, and cost stop being trade-offs you manage one region at a time,” he explains. “They become a capability you run as a system.”

Storage-to-compute path as a managed control point

To solve for these issues, enterprise teams need to stop treating the storage-to-compute path as a direct connection and start treating it as a managed control point, Smit says. SecureIQLab’s independent validation of F5 BIG-IP in storage deployments has confirmed the approach delivers resilience without surrendering throughput.

“Insert a full-proxy ADC between the two, and the path becomes observable, programmable, and failure-aware, with health-based routing, quality of service, and security enforced inline,” he explains. “That single move converts data delivery from an assumption into an engineered discipline, which is what keeps GPUs fed when conditions degrade.”

Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Orchestration

Why AI that works in the lab often fails in production — and what actually fixes it

Presented by Capital One Enterprises aren’t struggling to experiment with AI; they’re struggling to make it work in the real world. Moving from promising prototypes to reliable, production-scale systems is where most efforts stall.In my role within Cap…

Orchestration