AI agent orchestration platforms are popping up like weeds these days, but London-based AI transformation startup Mindstone’s Rebel might be among the most promising I’ve come across.
That’s because the system, which officially launched this week, is a local-first, agentic AI operating system distributed under a “Fair Source” license, allowing teams of under 100 users to freely adopt and customize it to suit their needs, while those organizations with more users will require paying for an enterprise license.
The marquee features are its simplicity and extensive customizability to fit any given team, no matter how unique or specific the workflows, all based around the common, open source standard file format markdown, and, as a result, an organizational memory layer that ensures agents reliably use the enterprise’s preferred AI models for each given task or even subtasks — dynamically switching between local and cloud ones in a predictable, visible way to save costs and maintain data privacy and security as needed.
“Shared memory is the most empowering thing you could possibly do with a knowledge-worker AI,” said Greg Detre, chief technology officer (CTO) of Mindstone, in a recent video call interview with VentureBeat. “You get this feeling of being a super-organism as a company that just gets smarter and smarter.”
Rebel is available now for macOS on Intel and Apple Silicon machines, as well as Windows, with Linux support in development.
Mindstone has raised $5 million from private investors including Pearson Ventures, Moonfire Ventures and Zanichelli Venture.
What makes Rebel distinctive is its local-first architecture.
Instead of the approach found in developer-heavy agent frameworks such as as LangGraph, CrewAI and AutoGPT, which require teams to wire together databases, cloud infrastructure and state-management logic, Rebel’s core agent memory and instructions live across local markdown (.md) text files — arguably the simplest, easiest, and most popular way to steer AI agents, one that has been widely adopted by AI developers and power users around the globe.
Mindstone says Rebel stores its state, prompts, task instructions and memory hierarchy in these files, allowing users and companies to easily inspect, move or modify them as needed. A primary configuration file, agents.md, acts as the agent’s core instruction layer and runtime boundary.
That architectural choice is partly about cost. Mindstone argues that common office formats such as Word documents and PDFs often carry formatting and metadata overhead that consumes model token context and raises API costs. Markdown keeps the information closer to raw text, allowing more of the model’s context window to be spent on the actual task rather than document structure.
The company also positions the approach as a hedge against vendor lock-in. If a company’s agent instructions, automations and memory are stored locally as text files, they are not trapped inside one SaaS provider’s interface or database. That matters more as enterprises begin giving AI systems broader access to email, calendars, documents and internal workflows.
Rebel also lets users create repeatable AI workflows. “Skills” are saved multi-step procedures an agent can reuse. “Operators” adjust how the agent behaves for a given task, such as reviewing a pitch deck from an investor’s perspective or evaluating work through a security lens. “Automations” can run scheduled background tasks, such as scanning messages or files, finding relevant updates, drafting responses, or preparing work before an employee opens the app.
Another important feature is multi-model orchestration. Rebel can break a task into parts and route different steps to different models, including splitting between local and cloud-based ones depending on the sensitivity of the information or as guided by enterprise policies.
A more powerful model can handle planning or complex reasoning; a cheaper model can handle routine work; a local model can handle sensitive steps or approval checks. This matters for enterprises that want flexibility or are seeking cost controls: not every task need be sent to the same expensive cloud model, and some enterprise workflows prohibit sensitive corporate data leaving local infrastructure.
“I want to be able to say, ‘Help me with this,’ and it knows what’s personal, what’s sensitive, and what can be shared with the whole company,” Detre explained.
That model-agnostic setup gives companies more control over cost and security. Data-heavy work can run on lower-cost models such as Llama or DeepSeek. Higher-level reasoning can be reserved for more expensive models. Sensitive work can be routed through a local model running on the user’s machine, keeping that information from leaving the device.
This approach also gives enterprise teams a way to mix cloud and local inference without treating the choice as all-or-nothing.
By shifting away from centralized, monolithic cloud interfaces toward a local file-driven architecture, Mindstone is introducing a model for how enterprise technical decision-makers orchestrate autonomous workflows without forfeiting data sovereignty or predictability
Mindstone CTO Greg Detre designed Rebel’s memory system to avoid a common problem in enterprise AI: dumping large amounts of company information into a database and hoping search will retrieve the right context later.
Instead, Rebel uses a tiered memory structure. When an interaction happens, the system estimates how likely that information is to be useful again.
Information with a high expected value is written into a local readme.md file tied to a specific project space. Information with a moderate expected value becomes a reference link back to deeper historical records.
Lower-priority material is stored in an indexed memory directory, where it remains available but dormant until a relevant task calls it back.
For larger organizations, Mindstone Pro adds an Impact Dashboard designed to show where Rebel is saving time and money across business units.
Mindstone says the dashboard uses a separate, closed LLM to evaluate telemetry and calculate business impact. The company says the system is calibrated conservatively, using the lower end of estimated performance gains to avoid inflated productivity claims.
That feature speaks to a practical problem for enterprise AI buyers: proving value without over-surveilling employees. Mindstone says the dashboard is isolated from individual workspaces, allowing IT and business leaders to evaluate adoption and return on investment without reading employees’ private agent activity.
Mindstone is releasing Rebel under a Fair Source license, a model meant to sit between fully closed SaaS and permissive open source.
Under the license, Rebel’s code is viewable, auditable, modifiable and deployable. Individuals and organizations with up to 100 concurrent users can run it for free. Once an organization exceeds that threshold, it needs a commercial Mindstone Pro license.
The license also includes a two-year sunset clause. Twenty-four months after a given version is released, that version automatically converts to the MIT open-source license.
For enterprise buyers, the practical pitch is that Rebel reduces the risk of being trapped. If every automation, memory file and agent instruction is stored locally in markdown, a company can move its data and workflows elsewhere if needed. The product may be commercial, but the underlying work is designed to remain inspectable and portable.
Rebel’s debut on the open access tech product sharing platform Product Hunt this week prompted technical questions about how a local-first agent should handle permissions, safety checks and shared memory.
One developer, Nikita Pokryschko, asked whether approval checks for sensitive actions could run entirely on a local model, or whether the gating logic still required a cloud call.
Detre responded by explaining Rebel’s separation between planning, execution and background safety logic. Wöhle added that companies can configure Rebel to rely entirely on a local model for gating decisions.
That distinction matters for corporate security teams. Autonomous agents often need broad permissions to read files, draft emails or interact with internal systems. If the final approval layer depends on an external cloud model, some companies may see that as a compliance risk. Mindstone is arguing that Rebel can keep those approval boundaries local.
A second discussion focused on how Rebel decides what memory can be shared. Product developer Clement Morel asked whether shareability is determined by content, user settings or learned behavior, and what happens if the system gets it wrong.
Detre said Rebel uses the user’s local “Chief-of-staff README” and defined spaces to separate private, team and company-wide information. When the agent encounters ambiguous context, the system pauses and asks the user for approval before proceeding.
That emphasis on visibility is part of Mindstone’s broader argument against opaque agent systems. As CEO Joshua Wöhle put it in a post on his LinkedIn account: “If an agent is going to sit inside your workspace, remember your context, and ask permission before changing the world, you should be able to see how it works. Not because everyone will read the code, but because someone can.”
Mindstone says Rebel has already been deployed across the 250-person workforce of customer Epignosis, covering sales, engineering, product, finance and customer success teams.
“The entire organization is operating on Rebel today,” Wöhle told VentureBeat.
Over a 12-week deployment, Mindstone says Epignosis recaptured the equivalent capacity of eight full-time roles. The company says adoption spread organically after employees saw colleagues automate time-consuming work, a pattern employees reportedly called the “potatoes effect.”
The Epignosis case is central to Mindstone’s argument that enterprise AI should not be treated as a set of isolated personal tools. Rebel’s shared-memory design is meant to let workflows move across teams and improve as more employees use them.
“The border between learning and doing is fading out – and that changes everything about how you scale,” Epignosis CEO Dimitris Tsingos said in a statement provided to VentureBeat by Mindstone.
Mindstone Learning Limited, headquartered in London, launched in 2020 under the direction of CEO Joshua Wöhle, previously a co-founder of the digital child safety firm SuperAwesome. Originally positioned in the consumer education technology market, the company built a digital curation tool likened to a “Spotify for learning” that utilized compound learning methodologies.
However, following the widespread commercialization of generative artificial intelligence platforms between 2022 and 2024, Mindstone moved into business-to-business enterprise enablement. Leadership identified a critical “last-mile” barrier: while AI tools promised substantial productivity gains, traditional corporate training failed to equip the workforce to practically integrate them into daily operations.
Today, Mindstone functions as a comprehensive enterprise software and training ecosystem designed to maximize corporate return on investment for existing AI licenses. The product architecture systematically addresses different organizational tiers through highly contextualized, “live-fire” software applications rather than abstract slide presentations.
Financially, Mindstone utilizes a hybrid capitalization strategy that interweaves institutional venture capital from entities like Moonfire Ventures and Pearson Ventures with community-based equity crowdfunding on platforms such as Seedrs and Crowdcube.
Mindstone has successfully penetrated the enterprise market, securing commercial contracts with blue-chip corporations including The Home Depot, Hyatt Hotels Corporation, Pearson, and Ernst & Young.
Ultimately, Mindstone positions itself as the crucial antidote to corporate inertia, ensuring organizations establish the internal competency required to execute successful AI transformations.
Rebel arrives as companies are trying to move from AI experimentation to AI operations. The first wave of enterprise adoption centered on access: giving employees chatbots, copilots and model subscriptions. Mindstone is betting the next wave will center on coordination.
That means shared memory, reusable workflows, local control, flexible model routing and measurable business impact. It also means giving enterprises a way to inspect the systems they are being asked to trust.
The company’s challenge now is execution. Local-first software can be harder to manage than cloud SaaS. Shared memory raises governance questions. Multi-model routing adds complexity. And enterprises will still need proof that agentic workflows can deliver reliable productivity gains without creating security or compliance headaches.
But Mindstone is making a clear argument: buying AI seats is not the same as building AI infrastructure. Rebel is its attempt to turn scattered employee experiments into an operating layer for work.
As enterprise AI agents take on increasingly complex, long-horizon tasks, their performance is often restricted by their harness, the software scaffolding that connects the backbone LLM to its environment.
Currently, harnesses are largely static and hand-crafted. Improving them is largely manual and they do not automatically improve based on the execution data they collect from their environment.
To address this engineering bottleneck, researchers at Xiaomi introduced HarnessX, a framework that treats the AI harness as a composable object and autonomously applies improvements to its code.
In real-world enterprise applications, this automated adaptation enables AI systems to dynamically adjust to application-specific requirements. Practical tests showed HarnessX delivering substantial performance gains across domains like software engineering and web interaction.
The results demonstrate that scaling the foundation model is not the only path to more capable AI — and for smaller models, it may not even be the best one. HarnessX’s harness evolution yielded an average +14.5% performance gain across 15 model-benchmark combinations; for the open-weight Qwen3.5-9B, gains reached +44% on embodied planning tasks.
In AI applications, a foundation model’s capability relies heavily on its surrounding harness. The harness acts as the operational layer that converts raw model outputs into structured, executable agent behaviors. It comprises the prompts, external tool integrations, memory management, and control flows that dictate how an AI system observes its environment, reasons through a problem, and takes action.
As enterprise agents take on more complex, long-horizon workflows, harness engineering has become a fundamental part of AI development. Despite its importance, harness development remains far from a mature engineering discipline and presents three key challenges.
First, harnesses are static and hand-engineered. Any shift in the underlying foundation model, the introduction of new tools, or a pivot to a different operational domain requires bespoke, manual code rewrites. Traditional harnesses lack mechanisms to autonomously learn and improve from past execution experiences.
Second, most existing harnesses suffer from architectural entanglement. They tightly couple prompt templates, tool wrappers, retry policies, and memory management within the same code paths. This entanglement means that tweaking one component can silently break others. Attempting to reuse a harness across different business domains often devolves into raw code copying rather than clean, modular composition.
Third, the harness and foundation model are optimized in isolation. When engineers run tests to improve the harness, the execution traces generated are typically discarded rather than used as training data to improve the model. Consequently, model upgrades do not naturally lead to harness improvements, creating a bottleneck where teams fail to capture the full value of their agent’s operational data.
HarnessX solves the engineering bottlenecks of manual harness development with what the researchers call a “unified harness foundry.”
The core innovation of HarnessX is treating the harness as a “first-class object”. In software engineering terms, this means the harness is an independently serializable, modular, and substitutable entity. By separating the model configuration (i.e., which AI model is operating) from the harness configuration, engineers can seamlessly swap, adapt, and evolve the scaffolding without touching the underlying model.
HarnessX breaks agent behavior down into different components, such as context assembly, memory management, tool ecosystems, control flow, and observability. Every specific behavior is implemented as a “processor” that plugs into precise lifecycle hooks of the harness. This modular structure allows the system to swap, add, or remove these processors without breaking the surrounding pipeline.
To automate the optimization of this modular structure, HarnessX introduces AEGIS, a trace-driven evolution engine. AEGIS frames harness adaptation as a reinforcement learning (RL) problem over the different symbolic components of the harness.
Framing harness optimization as a reinforcement learning problem introduces three pathologies the researchers had to explicitly engineer against:
Reward hacking: The system might exploit shortcuts to the solution instead of genuinely solving the task.
Catastrophic forgetting: An edit that fixes a failure pattern in one domain might silently break a previously solved workflow in another.
Under-exploration: The system might iterate on minor prompt tweaks rather than exploring new, structurally superior tool configurations.
To prevent these problems, AEGIS relies on full trace observability and a four-stage pipeline:
Digester: Compresses execution traces into structured summaries to identify where the agent failed.
Planner: Analyzes these summaries to enable the system to explore structural changes rather than just local prompt tweaks.
Evolver: Generates code-level harness edits and tests to ensure they run correctly before deployment.
Critic and gate: A Critic assesses the edits to detect reward hacking, while a deterministic gate rejects any update that regresses a previously solved task to prevent catastrophic forgetting.
HarnessX enters a growing field of self-improving harness research — but what separates it is harness-model co-evolution.
The researchers highlight that optimizing either component in isolation eventually hits a wall. Evolving only the harness hits a scaffolding ceiling if the underlying model lacks the reasoning capacity to use the new tools. Training only the model hits a training-signal ceiling if the harness never prompts the model to use its advanced capabilities.
HarnessX interleaves harness evolution with model training. The execution traces generated while the harness attempts to adapt to tasks are converted into reinforcement learning signals for the foundation model. Every time the harness improves its strategy, the model simultaneously learns to better exploit that new strategy, breaking the capability ceilings of traditional AI agent development.
HarnessX makes this co-evolution possible through cross-harness GRPO (Group Relative Policy Optimization). GRPO is the popular RL algorithm used to train reasoning models such as DeepSeek-R1.
When fine-tuning the model, cross-harness GRPO pools an agent’s execution trajectories for the same task across entirely different versions of the application’s harnesses. This allows the underlying model to internalize high-level strategy shifts, like using a new API endpoint or managing an execution budget, rather than just learning minor prompt-phrasing variations.
To validate the practical utility of HarnessX, the researchers tested it across five benchmarks comprising software engineering, multi-turn customer service dialog, web navigation, open-ended multi-step reasoning, and embodied planning.
They separated the AI into two roles. The “meta-agent,” powered by Claude Opus 4.6, analyzed logs and wrote the code to evolve the harnesses. The “task agents” ran the actual workflows. To prove the framework is model-agnostic, they tested it on three different worker models: Claude Sonnet 4.6, GPT-5.4, and the open-weight Qwen3.5-9B.
HarnessX was compared against two primary baselines. The first was a static harness, representing how most enterprises deploy AI today, using hand-crafted, frozen setups with benchmark-specific prompts and tools. The second was the Claude Code SDK, a baseline representing a single-agent evolver to test if the complex, four-stage AEGIS pipeline outperformed asking a single language model to iterate on the code.
Dynamically evolving the harness yields significant gains on the same base model. HarnessX improved performance in 14 out of 15 model-benchmark combinations. Across all tests, evolving the harness yielded an average absolute performance gain of +14.5%.
The weakest models benefited the most from dynamic harness improvement. The open-weight Qwen3.5-9B saw a +44.0% performance jump on the ALFWorld embodied planning benchmark, and an +18.2% jump on SWE-bench Verified for software engineering.
Co-evolution also proved highly effective. When the researchers trained the foundation model using the data generated while evolving the harness, they saw an additional +4.7% average performance boost. Improving the harness and the model simultaneously yields the highest ceiling. The co-evolution gain applies only to open-weight models.
Anecdotal evidence from the experiments shows how HarnessX solves pernicious problems when creating agent harnesses for real-world tasks. For example, in the GAIA multi-step reasoning benchmark, the task agent consistently failed because the headless browser tool it used to scrape Wikipedia timed out on the site’s JavaScript-heavy frontend. HarnessX analyzed the execution traces, diagnosed the error, and wrote a new tool that bypassed the browser entirely and queried the MediaWiki API directly for plain text. It swapped this tool into the harness and instantly unlocked the failing tasks.
During the WebShop e-commerce tests, the AI agent often got stuck in pagination loops, endlessly clicking “next page” and reformulating searches without ever committing to buying a product. Rather than just tweaking the prompt, HarnessX built an advisory processor that detected when the agent was repeating navigation actions. It injected a warning into the context to force a decision, curing the looping behavior and raising performance.
One important caveat is that the system currently relies on powerful models to act as the meta-agent that rewrites the harness code. In their experiments, the researchers relied on closed frontier models like Claude Opus. Open-weight models are quickly improving, but their ability to serve as the meta-agent remains untested.
Another limitation worth considering is the intrinsic capabilities of the used models. If the underlying task model is fundamentally too weak to execute the complex workflows the new harness proposes, HarnessX will not be able to improve the agent’s overall abilities (the researchers observed this with the Qwen3.5-9B model on the SWE-bench coding tests).
Despite these limitations, HarnessX makes a concrete case that harness engineering — not just model scaling — is a lever practitioners can pull now. For teams running smaller open-weight models on complex workflows, the gains here are large enough to justify evaluating harness evolution as a first step before reaching for a more expensive frontier model. The researchers plan to release the code in a future update.
Shopify built an LLM proxy that gives every engineer access to multiple AI providers — with automatic failover when any one of them goes down, changes, or disappears. When Claude Fable 5 shut down, Shopify’s engineers didn’t go into panic mode. The proxy shifted them to Claude Opus or GPT 5.5 automatically, without interrupting their workflows.
“Fable looks amazing; we used it of course,” Farhan Thawar, Shopify’s head of engineering, says in a new VentureBeat Beyond the Pilot podcast. “When a model comes and then it goes, or it could be as innocuous as an update, the proxy allows us to spray across the different providers,” Thawar says.
Shopify buys tokens in bulk and all users connect to models through its proxy, Thawar says. This gives his team access to reporting and failover; when there’s an availability issue with one provider, users can be “automatically, seamlessly” transferred to another.
Enterprises can learn from this example and consider how a disruption might affect their business, Thawar says. At the very least, they should establish a solid backup plan. It’s important to have a system that allows for movement across models so enterprises are not “super tied” to a specific provider.
Distillation is another important strategy.
With distillation, a student model learns from a teacher model and typically becomes specialized in a narrower task. These small language models (SLMs) can be more beneficial than generalized, off-the-shelf models in some circumstances. For instance, Shopify’s flagship AI assistant, Sidekick, which performs numerous specialized subtasks for merchants so they can “remove toil” from their day-to-day.
Using smaller distilled models can be faster and cheaper than more generalized models, Thawar says. In some cases they have proven to be 2x cheaper and faster; in more extreme cases 30x cheaper and faster, he says.
But “it isn’t just about cost and latency, which are big; it’s about accuracy,” Thawar says.
Engineers feed the UDP their teacher model, training data, evals, and a target model — say, Opus 4.8 distilling down to Qwen 3.5. The pipeline runs for about a day, then returns an evaluation showing what the fine-tuned model actually achieved on speed, cost, and accuracy for that subtask. If the tradeoff looks good, the engineer deploys it — no approval process required. Shopify’s internal platform, Tangle, lets anyone visualize the pipeline as it runs.
Thawar says his “dream” is to eventually not give the distillation pipeline a target model at all. Instead, users could provide the teacher model with data and evals and the directive: ‘Based on your learnings over time, I want you to look at a different class of model, different sizes, different types, and you tell me what the right distillation target is.’
“Maybe we’ll get surprised. Maybe it’ll be such a small model it could run on a phone,” Thawar says. “Other times, maybe it comes back and says, ‘There isn’t a way to distill this down to anything better than what we have at the frontier.’”
Shopify users can apply whatever harness they want: Claude Code, Codex, Cursor, GitHub Copilot for VS Code. “We expose everyone to the different harnesses so they can get a feel for what may or may not work in their workflow.”
But the company also implemented a usage dashboard; this allows Thawar’s team to ask interesting questions around not just token spend, but: Who’s using the most expensive tokens? Who’s spending more time on reasoning? What types of models are being used, and what disciplines and levels?
Regarding the “tokenmaxxing” question, Shopify does have “circuit breakers” in place. If a user has a model running for a long time (say, 10 hours) and it’s consuming a lot of tokens, they will get pinged, “Did you mean to spend this?”
As Thawar explains, sometimes the reply is “Oh, absolutely.” Other times it’s: ‘Whoa, I didn’t know that was running in the background. I totally forgot about it. I’d rather stop it now.’
The ultimate goal, as Thawar describes it, is to move from “AI reflexivity” to “AI leverage,” and get people to really think deeply about where they can benefit most from AI in their workflows.
Listen to the full podcast to hear more about:
Shopify’s philosophy of building infrastructure before features. As Thawar puts it: “We’ve always built more infra. We will continue to always build more infra.”
How Shopify’s internal AI agent, River, creates a “substrate of information” across the company.
How Thawar’s OpenClaw agent figured out he was traveling from his calendar — and what that moment told him about where agents are actually headed.
You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.
AI agents are increasingly proficient at executing business tasks autonomously, but IT leaders are cautious about granting permissions to access enterprise systems. Part of the challenge lies in how AI reliability is measured. Industry standards often …
Customer expectations have shifted from simple, fast conversational interactions to complex agentic AI-powered tasks that legacy IT architectures simply can’t handle.
To address this, Intuit made the bold decision to overhaul its technical infrastructure for its business platform. The company moved away from its multi-agent setup, which prioritized broad capabilities, to a granular, skill-and-tool-based architecture while embedding human experts directly into the workflow alongside AI. This shift involved decomposing its massive agents into specialized components, separating the brain from the hands, essentially.
“We went from a multi-agent system where we had large agents that did a lot to fully incorporating workflows, skills and tools down to the base level,” said Nhung Ho, VP of AI at Intuit. “We changed the orchestrator, we changed the planner, we changed the brain, and we also changed what everybody had to build across the whole company.”
At VB Transform 2026 on July 14 and 15, Ho will share details about the technology decisions behind building an abstraction layer behind Intuit’s system of intelligence. She’ll also share how the new architecture has allowed the company to decouple its orchestration from specific model providers, allowing Intuit to remain agile and use the best tools for the job, whether from large model providers or their own home-grown tools.
Other VB Transform sessions focused on agentic orchestration include:
From signals to shelves: How Target is engineering Agentic AI for the right product, right place, right time with speaker Siobhan McFeeney, SVP Technology, Target;
The engineer’s multiplier: How Instacart uses agentic AI to eliminate toil, elevate teams and slash costs with speaker Anirban Kundu, CTO, Instacart;
MCP connection isn’t orchestration: Building the agent execution layer with Arnab Bose, chief product officer, Asana;
Building the agentic workforce: A blueprint for scaling AI operations without the sprawl with Romit Jadhwani, Sr. Director, Enterprise AI, Data & Productivity, Rivian and Craig Wiley, VP of AI, Databricks; and
Inside Atlassian’s Living Lab: Deploying context-aware agents at scale with Dr. Molly Sands, head of the Teamwork Lab at Atlassian
Interested in attending VB Transform 2026? Register here. A select number of complimentary passes are also available to senior technology leaders. Contact us to get yours.
Presented by F5
When enterprises move AI workloads from pilot to production, data delivery often becomes the factor that determines whether those systems can scale reliably. Point-to-point architectures connecting storage directly to compute hold up under demonstration conditions, but they often break down under sustained, concurrent production traffic. The result is stalled inference pipelines, delayed RAG systems, underutilized GPUs, and SLA violations, all of which carry direct business consequences.
“Organizations successfully operationalize AI when their infrastructure is built to handle real-world failures, not just controlled conditions,” says Hunter Smit, senior manager of product marketing at F5.
In a pilot, a stalled transfer is an inconvenience, while in production, that same stall is an outage someone now owns. The underlying architecture is often identical in both cases: when a client is wired directly to storage, the system becomes increasingly fragile under sustained, concurrent production traffic because that direct connection has no answer when a node fails or traffic spikes. From there, retries and timeouts cascade, and the entire pipeline backs up right at the moment the business is depending on the output.
“Point-to-point architectures, where the S3 client connects directly to S3 storage, are not resilient,” says Paul Pindell, principal solutions architect for technology alliances at F5. “If a single storage node fails, all traffic to that cluster degrades, and in some cases the cluster can fail entirely.”
The problem is that AI workflows, including RAG-based inference and agentic AI, increasingly treat S3 storage as a first-class citizen in the AI cluster. However, the network connectivity between that storage and the cluster was never designed for the high-throughput, uninterrupted data movement that’s needed to keep GPUs running optimally.
“Enterprise leaders tend to frame AI infrastructure around GPU utilization, but what makes AI different from traditional deterministic workloads is that infrastructure continuously influences those outcomes at every interaction,” says Tanu Mutreja, senior director of product management at F5. “In AI environments, infrastructure is no longer just a back-end concern. It shapes customer experience, quality, resilience, and cost with every transaction.”
There can be significant business consequences. For instance, when inference pipelines stall, it becomes an SLA and customer experience issue. When RAG systems are delayed, models lose access to timely, relevant context, which results in inaccurate, outdated, or hallucinated responses, all of which create operational, compliance, and reputational risks. At the same time, the infrastructure issues that create those problems can also drive up costs by leaving expensive GPU resources idle or underutilized.
“When GPUs are underutilized, it signals infrastructure inefficiencies that inflate costs while limiting scalability and responsiveness,” Mutreja says. “The leadership question is whether the end-to-end AI infrastructure consistently delivers reliable, secure, high-quality, and governed AI experiences at sustainable unit economics.”
F5 treats data delivery as a first-class infrastructure layer rather than assuming the network path will simply work. Where application delivery optimized the flow of requests between users and applications, data delivery optimizes the flow of data between storage, networks, and compute, including AI compute.
Making data delivery a first-class layer means building three properties into it:
Observability provides real-time visibility into latency, throughput, and flow health.
Programmability enables policy-driven control over how data moves, through dynamic routing, traffic optimization, rate management, and automated failover.
Failure-awareness builds resilience for degraded networks, storage throttling, and service disruptions.
In the architecture F5 has developed for Dell ObjectScale, F5 BIG-IP sits between ObjectScale and AI compute as a programmable control point at the storage edge.
“We have seen cases where a misconfiguration in the AI compute layer effectively DDoS’d the S3 storage infrastructure, ” Pindell says. “Not in a malicious way, more of an ‘Oh no, what did I do?’ moment, but it still took storage down for the entire organization.”
Placing BIG-IP as the application delivery controller between the storage and compute layers protects storage with QoS, rate limits, and connection limits, keeping it resilient and operational under that kind of load. SecureIQLab-validated testing confirmed that this protection does not come at the cost of throughput, which matters architecturally, Pindell says.
“Preserving, and even improving, throughput is a must-have,” he explains. “It’s what lets you layer on the higher-level functionality, resilience and enhanced security, without giving up performance to get there.”
AI deployments in hybrid multicloud environments have an even greater data delivery challenge because of the heterogeneity involved. In other words, data traversing these environments must contend with inconsistent policies, security controls, identity systems, governance requirements, fragmented visibility, and distinct failure boundaries.
Programmable traffic management and observability address this complexity together. Observability provides a unified view of application, network, and infrastructure health across otherwise disconnected environments. Programmable traffic management uses those insights to intelligently route, balance, and fail over traffic in real time. Together, they create a closed-loop feedback system that enforces consistent policies, improves resilience across failure domains, and ensures reliable, high-performance AI data delivery regardless of where applications, data, or users reside.
The organizations that move beyond perpetual pilots share a specific engineering discipline, Smit says.
“They’re the ones that reach for production design with failure as the normal state, not the exception,” he explains. “They will assume latency, congestion, and partial outages will happen. And they build a data path observable and failure-aware enough to absorb them, with explicit mitigation for every degraded condition rather than a hope that the network will hold.”
Organizations stuck in perpetual pilots are still optimizing for the perfect lab result and discovering the real-world gap only when a workload goes live. The issue is not model quality or GPU count, but whether the data delivery layer was engineered with the same rigor as the compute.
“Teams need to understand that a real-world network behaves very differently from an optimized lab network,” Pindell says. “They need a mitigation plan for the failure states and performance bottlenecks they will hit in production.”
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.
Last night, the increasingly enterprise-focused AI startup Sakana launched Fugu, a multi-agent orchestration system that delivers frontier-level AI performance through a single, OpenAI-compatible API.
Designed for developers, enterprises, and nations seeking resilience against vendor lock-in and geopolitical export controls, Fugu (Japanese for “pufferfish”), bypasses the traditional monolithic model structure by dynamically routing queries to a swappable pool of specialized AI agents.
Sakana CEO and co-founder David Ha, formerly of Google Brain, positioned Fugu as a more reliable option for enterprise workflows than any single AI model provider in the wake of Anthropic’s move on June 12 to revoke public access to its most powerful models, Claude Mythos 5 and Claude Fable 5, in the wake of a U.S. government export control order. As Ha wrote in a post today on X:
“Fugu dynamically orchestrates the world’s best models to tackle complex tasks. We are proving that a well-orchestrated pool of swappable agents can match restricted frontier models like Fable and Mythos.
But Fugu is about more than just performance. I believe that Orchestration Models are the next frontier, beyond bigger models.
Relying on a single company’s model for national infrastructure is a massive risk. As recent export controls have shown, access to top models can disappear overnight.
Collective intelligence is the practical hedge against this concentration of power. Fugu simply routes around vendor restrictions by relying on an entirely swappable agent pool.”
Sakana AI explicitly states that the specific models Fugu selects and how it coordinates them are proprietary, meaning this routing information is hidden from the user by design. The documentation only refers generally to a “diverse pool of powerful models,” “multiple LLMs,” or “specialized models” without providing a specific count.
By acting as a sophisticated coordinator rather than a standalone foundation model, Fugu matches the output quality of top-tier models like Fable and Mythos on third-party benchmarks of agentic tasks, while fundamentally altering how developers deploy critical AI infrastructure.
At its core, Sakana Fugu operates like a master general contractor. When presented with a complex request, Fugu does not attempt to execute every step itself.
Instead, it breaks the problem down, delegates sub-tasks to a pool of expert foundation models, verifies their work, and synthesizes the final output.
“Fugu is itself an LLM, trained to call various LLMs in an agent pool, including instances of itself recursively,” the Sakana AI team noted in their technical release.
Grounded in two of Sakana’s 2026 research papers, TRINITY and the Conductor, the system autonomously manages the entire lifecycle of model selection and verification using learned coordination strategies rather than hand-designed workflows. To the end user, this multi-agent swarm is entirely abstracted behind a standard API endpoint.
Sakana AI is offering two variants of the system to cater to different operational workloads:
Fugu: A high-speed, low-latency model optimized for everyday tasks. It is designed to act as the default engine for interactive chatbots and integrates directly into coding environments like Codex.
Fugu Ultra: The flagship tier engineered for complex, high-stakes tasks such as AI research, cybersecurity analysis, and multi-step patent investigations. According to Sakana, Fugu Ultra coordinates a deeper pool of experts and matches industry-leading monolithic models across rigorous scientific and reasoning benchmarks.
Additionally, on the pay-as-you-go plan, standard Fugu charges a dynamic rate based on the specific underlying models activated, whereas Fugu Ultra utilizes a fixed pricing structure starting at $5 per million input tokens and $30 per million output tokens.
As indicated by benchmark charts shared by Sakana, Fugu actually exceeds the performance of Anthropic’s Claude Fable 5 on LiveCodeBench, an open source benchmark testing coding performance on regularly refreshed, software problem-solving tasks (Fugu Ultra: 93.2, Fugu: 92.9, Fable: 89.8), and beats the prior Claude Mythos Preview model on GPQA-D (Diamond) , a test of 198 graduate-level multiple-choice questions in biology, physics, and chemistry (Fugu Ultra: 95.5, Fugu: 95.5, Mythos Preview: 94.6).
By orchestrating multiple models from different providers, Fugu essentially builds native redundancy into the AI stack. If one provider suffers an outage or faces sudden regulatory restrictions, Fugu routes around the disruption to maintain uptime.
Fugu is offered as a commercial, proprietary API service, not an open-source framework.
Because Sakana’s core intellectual property lies in its non-obvious collaboration patterns, the specific routing information—meaning exactly which underlying models Fugu selects for a given query—remains proprietary and is intentionally hidden from the user.
However, Sakana offers critical controls for enterprise data compliance. Developers can explicitly opt specific models or providers out of their Fugu routing pool to maintain strict corporate privacy standards.
Additionally, users can opt out of having their prompts used for future training data. Geographically, Fugu is restricted from operating within the European Union (EU) and European Economic Area (EEA) while Sakana works to align its black-box data routing architecture with GDPR regulations.
Fugu is available immediately in most regions—with the temporary exception of the EU and EEA—at subscription tiers and pay-as-you-go pricing.
Teams can opt for monthly subscription allowances designed for individual or hands-on use: a Standard tier at $20/month for lightweight workflows, a Pro tier at $100/month providing 10x standard usage, and a Max tier at $200/month offering 20x usage for continuous, long-running tasks. I wasn’t able to find the actual amount of tokens covered under these plans, but I’ve reached out to Ha on X for more information.
As part of the initial rollout, Sakana is offering a free second month for users who subscribe to any tier by July 31, 2026.
For enterprise scaling and production deployments, Sakana offers an elastic pay-as-you-go plan. Crucially for high-stakes environments, requests made under this consumption-based model are served at a higher priority than those from monthly subscription plans.
Under this framework, the standard Fugu engine charges the single rate of the highest-tier underlying model involved in a query, without ever stacking multi-agent fees. The flagship Fugu Ultra tier (fugu-ultra-20260615) utilizes a fixed pricing structure per one million tokens: $5 for input, $30 for output, and $0.50 for cached input. These rates increase to $10, $45, and $1.00 respectively for extreme workloads utilizing context windows above 272K tokens. That puts it among the more expensive options compared to single AI models via provider APIs:
|
Model |
Input |
Output |
Total Cost |
Source |
|
MiMo-V2.5 Flash |
$0.10 |
$0.30 |
$0.40 |
Xiaomi MiMo |
|
deepseek-v4-flash |
$0.14 |
$0.28 |
$0.42 |
DeepSeek |
|
deepseek-v4-pro |
$0.435 |
$0.87 |
$1.305 |
DeepSeek |
|
MiniMax-M3 |
$0.30 |
$1.20 |
$1.50 |
MiniMax |
|
Gemini 3.1 Flash-Lite |
$0.25 |
$1.50 |
$1.75 |
|
|
Qwen3.7-Plus |
$0.40 |
$1.60 |
$2.00 |
Alibaba Cloud |
|
MiMo-V2.5 |
$0.40 |
$2.00 |
$2.40 |
Xiaomi MiMo |
|
Grok 4.3 (low context) |
$1.25 |
$2.50 |
$3.75 |
xAI |
|
MiMo-V2.5 Pro (≤256K) |
$1.00 |
$3.00 |
$4.00 |
Xiaomi MiMo |
|
Kimi-K2.6 |
$0.95 |
$4.00 |
$4.95 |
Moonshot |
|
GLM-5.2 |
$1.40 |
$4.40 |
$5.80 |
Z.ai |
|
Grok 4.3 (high context) |
$2.50 |
$5.00 |
$7.50 |
xAI |
|
MiMo-V2.5 Pro (>256K) |
$2.00 |
$6.00 |
$8.00 |
Xiaomi MiMo |
|
Qwen3.7-Max |
$2.50 |
$7.50 |
$10.00 |
Alibaba Cloud |
|
Gemini 3.5 Flash |
$1.50 |
$9.00 |
$10.50 |
|
|
Gemini 3.1 Pro Preview (≤200K) |
$2.00 |
$12.00 |
$14.00 |
|
|
GPT-5.4 |
$2.50 |
$15.00 |
$17.50 |
OpenAI |
|
Gemini 3.1 Pro Preview (>200K) |
$4.00 |
$18.00 |
$22.00 |
|
|
Claude Opus 4.8 |
$5.00 |
$25.00 |
$30.00 |
Anthropic |
|
GPT-5.5 |
$5.00 |
$30.00 |
$35.00 |
OpenAI |
|
Sakana Fugu Ultra |
$5.00 |
$30.00 |
$35.00 |
Sakana AI |
|
Claude Fable 5 / Claude Mythos 5 |
$10.00 |
$50.00 |
$60.00 |
Anthropic |
Developers modeling operational costs should also note a significant architectural caveat in how Fugu bills for its multi-agent capabilities. According to the developer documentation, Fugu Ultra’s API responses include detailed usage fields that separate user-visible token generation from internal orchestration work. The background tokens consumed and generated when Fugu delegates sub-tasks, verifies code, or routes between underlying agents are not absorbed by the provider; they represent real token usage and are counted toward the final price of the request at standard rates.
To understand Fugu’s position in the mid-2026 AI ecosystem, it is critical to distinguish between model routing and multi-agent orchestration.
Over the past year, enterprise adoption of standard routing platforms—such as Not Diamond, Martian, and the open-source RouteLLM framework—has skyrocketed. These systems act as intelligent air traffic controllers; using semantic classifiers or meta-models, they analyze an incoming prompt and predict which single foundation model will yield the highest quality or most cost-effective response, dispatching the query accordingly.
Fugu operates on a fundamentally different paradigm. Rather than making a one-shot routing decision, Fugu aligns more closely with complex multi-round systems like Router-R1 (a framework introduced at NeurIPS 2025). It breaks a query down, interleaves reasoning with delegation, and dynamically assigns sub-tasks to multiple models in parallel or sequence before synthesizing a final output.
While frameworks like LangGraph, CrewAI, and Microsoft AutoGen offer developers the tools to build similar multi-agent systems, they require immense manual configuration—defining roles, setting up conditional edges, and managing state across long-running loops.
Fugu abstracts this operational overhead entirely. It is essentially a LangGraph-style workflow packaged as a single, black-box API endpoint.
An orchestration system is ultimately bounded by the raw capabilities of the underlying models in its pool, a reality reflected in Sakana’s own benchmark testing against standalone frontier models.
On rigorous coding and agentic tasks, collective intelligence shows a distinct advantage over standard models. Fugu Ultra posted a 73.7 on SWE-Bench Pro, significantly outperforming Anthropic’s Claude Opus 4.8 (69.2) and OpenAI’s GPT-5.5 (58.6).
However, Fugu is not a silver bullet, and its performance is not a clean sweep across the board. When compared to highly specialized or restricted-access monolithic models, Fugu occasionally trails:
SWE-Bench Pro: While Fugu Ultra (73.7) beat most accessible models, it was comfortably eclipsed by Anthropic’s limited-access Fable 5 (80.0), which is currently absent from Fugu’s swappable pool due to the U.S. government’s export control order and Anthropic’s subsequent response to remove the model entirely from global usage.
Humanity’s Last Exam: Fugu Ultra (50.0) narrowly edged out Opus 4.8 (49.8), but again fell short of Fable 5 (53.3).
Long-Context and Security: On the MRCRv2 long-context-recall test, OpenAI’s GPT-5.5 maintained the lead (94.8 vs Fugu Ultra’s 93.6), and Opus 4.8 remained the top performer on the CTI-REALM cybersecurity benchmark (69.6 vs Fugu Ultra’s 69.4).
The quantitative data points to a clear conclusion: Fugu is highly effective at boosting performance on messy, multi-step tasks (like writing a complex HTML5 game from scratch) by leaning on the combined strengths of multiple mid-tier and high-tier models.
However, for sheer brute-force reasoning within a single, highly constrained domain, the industry’s largest standalone models still hold the edge—provided an enterprise can maintain uninterrupted access to them.
Sakana AI was formed in Tokyo in 2023 by Llion Jones, a co-author of Google’s foundational 2017 “Attention Is All You Need” paper, and David Ha, the former head of research at Stability AI.
Disillusioned by large tech company bureaucracy and the industry’s hyper-fixation on scaling single, massive foundational models, the founders built Sakana around principles of biomimicry and evolutionary computing.
The company’s name, derived from the Japanese word for fish, reflects its core technical thesis: utilizing collective “swarm” intelligence rather than brute-force compute. Following a $2.6 billion Series B valuation in late 2025 and the recent June 2026 launch of Marlin—an autonomous, eight-hour research agent for the B2B sector—Fugu represents the commercialization of Sakana’s multi-agent routing technology for everyday developers.
The developer community has responded to Fugu by rigorously testing its practical tradeoffs, weighing its routing efficiencies against the sheer power of monolithic foundation models.
AI observer, developer and influencer Chris (@ChrissGPT on X) highlighted the specific utility of Fugu over raw foundational AI.
“For a single clean prompt, you probably would [use Fable 5, Mythos, or GPT-5.5 directly],” he noted, but argued that Fugu’s true value emerges in messy, multi-step environments. “…whether it involves delegation, verification, synthesis, code review, research loops, security analysis… the more it would make sense to use this,” he wrote.
Chris also pointed out the strategic geopolitical advantage of Fugu’s architecture, noting that if frontier AI access is abruptly revoked due to regulation or export controls, an orchestrator can dynamically swap models to prevent a total system failure.
Creative agency owner Mark Santos (@markksantos) of Mark Studios provided a direct, real-world comparison by tasking both Fugu Ultra and Claude Opus 4.8 with building a “Crossy Road” game clone using Three.js. The results underscored the operational differences between an orchestrator and a monolithic giant:
Sakana Fugu Ultra: Completed the task in 22 minutes using ~89,000 tokens for roughly $7.32. However, the final game suffered from minor logic errors, such as inverted directional turns and wonky camera angles.
Claude Opus 4.8: Took 79 minutes, burned ~940,000 tokens for nearly $37.85, and got stuck in a retry loop requiring human intervention. Despite the inefficiency, it ultimately produced superior application design and functionality.
Santos concluded the experiment by stating, “In terms of application functionality, quality, and design, Opus won. In terms of model speed and performance, Fugu… won”.
Elie Bakouch, a research engineer at cloud-based, open AI infrastructure and systems provider Prime Intellect, pointed out on X that “to be clear, this is a closed source orchestrator on top of closed source models. if before you didn’t control the models, now you don’t even control which ones are used or how much. this is not ‘AI sovereignty’…”
These early tests and reactions mirror the sentiment summarized by Reddit user GreedyWorking1499 in initial platform discussions: “Until proven otherwise, this is just a highly advanced router/wrapper, not a fundamental not a fundamental leap in intelligence like Mythos/Fable was.“
Yet, as enterprises increasingly demand fail-safes against single-vendor reliance, Sakana is proving that packaging collective intelligence into a single API endpoint is a highly viable commercial path.
Presented by SplunkEvery day, organizations learn things their AI systems never get to use.A security analyst corrects an AI-generated investigation. A network engineer identifies the root cause of a recurring outage. An observability team discovers th…
Not every company can or should build their own frontier AI language model. However, the harness controlling the model is something that most enterprises can and should customize for their specific purposes.
Of course, this is easier said than done. Agent harnesses are still largely tuned through manual, ad hoc debugging — a process that relies heavily on intuition rather than systematic feedback loops, making it difficult to keep pace with rapidly evolving LLMs.
To solve this challenge, researchers at the Shanghai Artificial Intelligence Laboratory have introduced “Self-Harness,” a new paradigm in which an LLM-based agent systematically improves its own operating rules. By examining its own execution traces to apply edits, the system trades manual guesswork for empirical evidence.
Self-improving harnesses can enable development teams to deploy robust custom agents that continually adapt their own execution protocols to overcome model-specific weaknesses.
An LLM-based agent’s performance is not determined solely by its underlying base model, but also by its harness: the surrounding system that provides context and enables the model to interact with the environment. A harness includes components like system prompts, tools, memory, verification rules, runtime policies, orchestration logic, and failure-recovery procedures.
This layer is crucial because many common agent failures stem from the harness rather than the model. For example, an agent may report success without checking the model’s response (e.g., running the code to see if it passes the tests), or it might retry a failed action repeatedly. The harness is also responsible for preventing context rot or overload when the agent’s interaction history grows very large. Examples of popular harnesses include SWE-agent, Claude Code, Codex, and OpenHands.
Harness engineering remains a significant challenge, but the bottleneck isn’t necessarily that humans are too slow or incapable.
In fact, Hangfan Zhang, lead author of the Self-Harness paper, told VentureBeat that “in many cases, an experienced engineer with deep domain knowledge can still propose better changes than an LLM can today.”
Instead, the true bottleneck of manual engineering is that it relies heavily on ad hoc debugging rather than a verifiable, empirical feedback loop. “The deeper issue is that the current harness-engineering paradigm often lacks a systematic feedback loop,” Zhang explained. “Many edits are made based on intuition, a few observed failures, or ad hoc debugging.”
With new models being released at a rapid pace, depending on human intuition to manually tune model-specific harnesses becomes increasingly costly and untenable. While some approaches use stronger models to improve the harnesses of weaker target agents, this dependence on external guidance has its own challenges, as these models may be costly, unavailable for frontier models, or mismatched to the target model’s failure modes.
The Self-Harness paradigm enables an LLM-based agent to improve its own harness without relying on human engineers or stronger external models.
This continuous self-evolution is driven by a three-stage iterative loop that turns behavioral evidence into harness updates:
Weakness mining: Starting from an initial harness, the agent runs a set of tasks, producing execution traces with verifiable outcomes. The agent categorizes failed traces and tries to detect model-specific failure patterns.
Harness proposal: Based on these failure patterns, the agent uses a “proposer” role to generate a set of diverse yet minimal harness modifications, each tied to a specific failure mechanism to avoid overly general corrections.
Proposal validation: The system evaluates candidate modifications through regression tests. An edit is promoted only if it improves performance without causing measurable degradation on held-out tasks. If multiple candidate modifications pass the regression tests, they are merged into the next version of the harness, which then serves as the starting point for the next iteration.
To visualize why an enterprise would need this, imagine an automated issue-fixing agent that reads internal documentation, writes patches, and opens pull requests. If the company updates its documentation style, the agent might suddenly fail, pulling the wrong context or writing bad patches.
On the surface, the agent simply looks broken. But Self-Harness turns this ambiguous failure into a solvable problem. “The failure traces expose where the agent is misusing the new documentation format; the proposer can generate a targeted harness edit… and the evaluator can decide whether that edit improves the failing cases without regressing other cases,” Zhang said.
The researchers evaluated Self-Harness on Terminal-Bench-2.0, a benchmark that tests general tool-based execution, including artifact management, command use, verification behavior, and recovery from execution errors. They applied Self-Harness with MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5.
To isolate the impact of the self-evolving harness, they started with a minimal harness built upon the DeepAgent SDK, containing only the benchmark-facing system prompt, and the default filesystem and shell tools. The model backend, tool set, benchmark environment, and evaluator were kept unchanged while only the harness was allowed to vary.
The quantitative results show that agents improved their performance through automated harness edits. On held-out tasks, performance jumped significantly across the board, ranging from 33 to 60 percent relative improvements for different models.
Importantly, an explicit acceptance rule promotes only those edits that improve performance without introducing unacceptable regressions. What makes Self-Harness powerful for enterprise applications is that it doesn’t simply make the prompt longer or add generic instructions. Instead, it introduces targeted changes that reflect the recurring problems each model encounters during execution.
For example, under the baseline harness, MiniMax M2.5 would get stuck endlessly exploring dataset configurations until the execution environment timed out, failing to produce any deliverables. Through Self-Harness, the system identified this specific flaw and wrote a “loop breaker” into its runtime policy, forcing the agent to stop and redirect its approach after 50 tool calls. It also added a rule to create an initial version of required artifacts as early as possible.
On the other hand, Qwen-3.5 had a habit of hitting a file overwrite error and then blindly retrying the same command repeatedly, eventually deleting necessary files out of confusion before stopping. The self-harness fixed this by introducing a strict command-retry discipline (forbidding exact duplicate commands) and a mechanism that forced the agent to immediately recreate any missing artifacts if a file error occurred.
GLM-5 struggled to preserve environment changes across different commands, and would often waste time on massive downloads or finalize tasks even when sanity checks were failing. Its self-generated harness introduced rules instructing the agent to persist PATH variables across shell sessions, limit external compute, and repair any failed sanity checks before concluding its run.
While Self-Harness automates the tedious work of tracking down idiosyncratic model failures, decision-makers must be realistic about the trade-offs. Replacing human engineering with automated trial-and-error requires significant computational overhead.
“Self-Harness replaces part of the human engineering burden with repeated proposal generation, parallel candidate evaluation, and regression testing,” Zhang said. “That can mean more API tokens, more latency during optimization, and more infrastructure for running evaluation tasks.”
Also, this system relies on the accuracy of its evaluation pipeline. During their experiments on Terminal-Bench-2.0, the researchers relied on strict, deterministic verifiers to ensure the agent’s edits were actually helpful. Without this rigorous ground truth, an automated system risks promoting bad updates. “[The] evaluation system is not an optional component; it is what lets us trade human intuition for empirical evidence,” Zhang said.
This reliance on strict verifiers also dictates where Self-Harness should be deployed. “The best deployment targets today are environments where failures can be measured and where trial-and-error is relatively safe,” Zhang said, pointing to coding, internal workflow automation, and DevOps data pipelines as ideal use cases.
Conversely, enterprises should avoid fully automating harnesses in high-stakes or subjective fields. “The clearest red flags are domains where evaluation is subjective, delayed, non-deterministic, or costly to get wrong, such as medical decision-making, safety-critical infrastructure, or legal decisions.”
The introduction of self-improving agents does not mean coding or enterprise workflows will suddenly become human-free. The quality of collaboration between the human engineer and the AI is still paramount and difficult to capture with automated benchmarks.
Instead, the engineering profession is moving up the abstraction layer. “The role of enterprise engineers will shift from manually patching individual prompts or tool calls toward designing the feedback systems that make agent improvement possible,” Zhang predicted. Moving forward, “the engineer becomes less of a prompt tweaker and more of a feedback architect.”
As foundational models grow more capable, they will naturally absorb many capabilities that currently require manual harness engineering. “But once that happens, the harness will not disappear; its scope will move outward to connect the model to richer external environments,” Zhang said. “Until that boundary moves beyond what humans can evaluate, humans will remain critical providers of feedback.”
Presented by Solidigm
As inference workloads evolve from discrete question-and-answer exchanges into persistent, multi-step agentic systems, GPU availability is no longer the most critical AI bottleneck. Instead, the bottleneck has migrated from compute to context, says Jeff Harthorn, AI applied research lead at Solidigm.
“Why context management has become a primary bottleneck, more than GPU availability or compute efficiency, is the question of 2026,” says Harthorn. “GPUs have gotten dramatically cheaper per FLOP. Model architectures and inference serving engines have all gotten much more efficient. But the thing that’s grown faster than both of those is context. The persistent state that has to live between sessions has grown even faster than context itself.”
It’s happening as context windows grow dramatically, making individual inputs far larger than before. Agentic AI systems chain dozens or hundreds of model calls together, each generating state that must be tracked, and enterprises are requiring that inference state persist across sessions for audit, governance, and reuse. These trends compound each other, pushing context volumes beyond what any existing memory tier was designed to handle.
“Those three things are all happening at the same time, all of which are pushing context data and context memory into the stratosphere much more quickly than we’re used to seeing,” adds Ace Stryker, director of AI and ecosystem marketing at Solidigm.
The solution is a dedicated context tier emerging between GPU memory and bulk network storage: a layer of high-performance, high-density flash designed specifically to hold and serve Key-value (KV) cache, the inference data that allows models to retain and reuse context, and retrieval data at inference speed. Nvidia has formalized this architecture under the term CMX. Storage companies including Solidigm are building SSD products optimized for this workload.
“Storage has not been the first thing folks have thought about when they’ve been planning their enterprise infrastructure buildout,” Stryker says. “In a lot of ways, it was a relatively small cost compared to compute, and it was a commodity. You just shopped around for the lowest dollar per gigabyte and called it good. But now, if your storage is not up to snuff, your ROI suffers, and it directly impacts your bottom line.”
The storage architecture that AI systems rely on today was largely inherited from training workflows. Training is sequential and write-dominated, with data moving in large blocks to and from bulk object storage. The tier structure, with high-bandwidth memory on the GPU, fast NVMe in the server, and bulk storage over the network, serves that use case reasonably well.
However, inference is a different animal. Its I/O signature is fine-grained, latency-sensitive, and increasingly stateful. KV cache data and retrieval data each have distinct access patterns, but both need to be served quickly and reused across interactions. Neither fits cleanly within GPU high-bandwidth memory, which is expensive and physically constrained, nor within traditional bulk storage, which was never designed for active inference workloads.
“The architectural gap that’s interesting to me right now isn’t at the top of the stack or the bottom, it’s right in the middle,” Harthon says. “A lot of what sits below the GPU HBM is being asked to do things it wasn’t really designed for, which is where the most interesting systems work today is happening.”
One of the most visible symptoms of this gap is recomputation. In inference, the pre-fill stage processes all of the context relevant to a given session before token generation can begin. When KV cache state isn’t available in a fast, accessible tier, the system recomputes it — burning GPU cycles that produce no new value.
“A meaningful share of GPU cycles end up going to re-pre-filling,” Harthon explains. “During all of that calculated context, that’s potentially compute that’s being spent reproducing state, rather than doing new work. When you start looking at the problem that way, GPU utilization starts looking like it’s partly a storage problem.”
This reframing is driving renewed interest in a metric borrowed from networking: goodput, or useful tokens per dollar, rather than raw tokens per dollar.
The industry’s response is taking structural form. A new tier is emerging between GPU memory and traditional network storage, designed specifically to hold and serve inference context, a layer distinct from drives inside GPU servers (G3) and storage servers over the network (G4), engineered to serve context data back to accelerators as rapidly as possible.
“If you’re building a data center starting in the second half of this year, or the beginning of next year, you can’t think about storage only living in two places,” Stryker says. “Storage has to live in at least three places to handle the context memory tier, and that’s likely to be a permanent fixture in how the infrastructure gets built going forward.”
It’s analogous to the emergence of object storage as a category, which didn’t exist until enough workloads needed it. And once it did, it developed its own primitives, SLAs, cost models, and an ecosystem of vendors.
“The context tier looks like it might be on a similar arc,” Harthorn says. “That volumetric pressure is causing the category to form, rather than any one vendor’s road map.”
For infrastructure leaders, this means actively planning for the new tier rather than treating it as optional. Deploying additional NAND at this layer reduces dependency on DRAM, which is orders of magnitude more expensive per gigabyte and constrained in both availability and thermal headroom.
“In terms of your investment effectiveness, you’re laying out less cash to do it if you rely on the SSD layer in the way that Nvidia is now recommending and prescribing for a lot of use cases,” Stryker adds.
Participating meaningfully in the inference stack places new demands on SSD technology. Tail latency, the worst-case performance of a drive, must be predictable, not just fast on average. An orchestration system that allocates GPU resources based on expected storage response times cannot tolerate unexpected multi-second delays. Consistent, observable performance matters more here than peak throughput.
Beyond latency, density becomes a critical concern, especially at hyperscale. In data centers where power, not cost, is the binding constraint, watts per petabyte becomes the operative metric. Floating gate NAND, the manufacturing approach at the core of Solidigm’s products, is suited to that calculation. Network integration via NVMe over Fabrics, RDMA, and eventual CXL support is also essential, given the tight latency budgets of active inference pipelines.
“The drives have to have reliable performance characteristics, beyond the throughput side and being able to transfer as much data as possible as fast as possible, the way that training needed,” Harthon says. “Now it’s about being able to do it very consistently, in a way that’s very observable to the people operating and orchestrating these systems.”
The standards, software primitives, and best practices being established now will define how AI inference infrastructure operates for years to come. Solidigm is engaged in that process through standards bodies, partner lab collaborations, and published research, which is critical precisely because the category is still forming.
“The interesting question for the next couple of years isn’t whether AI infrastructure needs more compute,” Harthorn says. “It’s whether it can use what it has more efficiently. A lot of that answer runs through this tier that is being built today.”
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.