Admin.Foundation » Category » Orchestration

OpenAI unveils Workspace Agents, a successor to custom GPTs for enterprises that can plug directly into Slack, Salesforce and more

OpenAI introduced a new paradigm and product today that is likely to have huge implications for enterprises seeking to adopt and control fleets of AI agent workers.

Called “Workspace Agents,” OpenAI’s new offering essentially allows users on its ChatGPT Business ($20 per user per month) and variably priced Enterprise, Edu and Teachers subscription plans to design or select from pre-existing agent templates that can take on work tasks across third-party apps and data sources including Slack, Google Drive, Microsoft apps, Salesforce, Notion, Atlassian Rovo, and other popular enterprise applications.

Put simply: these agents can be created and accessed from ChatGPT, but users can also add them to third-party apps like Slack, communicate with them across disparate channels, ask them to use information from the channel they’re in and other third-party tools and apps, and the agents will go off and do work like drafting emails to the entire team, selected members, or pull data and make presentations.

Human users can trust that the agent will manage all this complexity and complete the task as requested, even if the user who requested it leaves.

It’s the end of “babysitting” agents and the start of letting them go off and get shit done for your business — according to your defined business processes and permissions, of course.

The product experience appears centered on the Agents tab in the ChatGPT sidebar, where teams can discover and manage shared agents.

This functions as a kind of team directory: a place where agents built by coworkers can be reused across a workspace. The broader idea is that AI becomes less of an individual productivity trick and more of a shared organizational resource.

In this sense, OpenAI is targeting one of office work’s oldest pain points: the handoff between people, systems, and steps in a process.

OpenAI says workspace agents will be free for the next two weeks, until May 6, 2026, after which credit-based pricing will begin. The company also says more capabilities are on the way, including new triggers to start work automatically, better dashboards, more ways for agents to take action across business tools, and support for workspace agents in its AI code generation app, Codex.

For more information on how to get started building and using them, OpenAI recommends heading over to its online academy page on them here and its help desk documentation here.

The Codex backbone

The most significant shift in this announcement is the move away from purely session-based interaction. Workspace agents are powered by Codex — the cloud-based, partially open-source AI coding harness that OpenAI has been aggressively expanding in 2026 — which gives them access to a workspace for files, code, tools, and memory.

OpenAI says the agents can do far more than answer a prompt. They can write or run code, use connected apps, remember what they have learned, and continue work across multiple steps.

That description lines up closely with the capabilities OpenAI shipped into Codex just six days ago, including background computer use, more than 90 new plugins spanning tools like Atlassian Rovo, CircleCI, GitLab, Microsoft Suite, Neon by Databricks, and Render, plus image generation, persistent memory, and the ability to schedule future work and wake up on its own to continue across days or weeks.

Workspace agents inherit that plumbing. When one pulls a Friday metrics report, it is effectively spinning up a Codex cloud session with the right tools attached, running code to fetch and transform data, rendering charts, writing the narrative, and persisting what it learned for next week.

When that same agent is deployed to a Slack channel, it is a Codex instance listening for mentions and threading its work back in.

This is the technical decision enterprise buyers should focus on. Building an agent on a code-execution substrate rather than a pure LLM-call-and-response loop is what gives workspace agents the ability to do real work — transforming a CSV, reconciling two systems of record, generating a chart that is actually correct — rather than describing what the work would look like.

Persistence and scheduling

In earlier AI assistant models, progress paused when the user stopped interacting. Workspace agents change that by running in the cloud and supporting long-running workflows. Teams can also set them to run on a schedule.

That means a recurring reporting agent can pull data on a set cadence, generate charts and summaries, and share the results with a team without anyone manually kicking off the process.

Here at VentureBeat, we analyze story traffic and user return rate on a weekly basis — exactly the kind of recurring, multi-step, multi-source task that could theoretically be automated with a single workspace agent. Any enterprise with a weekly reporting rhythm pulling from dynamic data sources is likely to find a use for these agents.

Agents also retain memory across runs. OpenAI says they can be guided and corrected in conversation, so they improve the more a team uses them.

Over time they start to reflect how a team actually works — its processes, its standards, its preferred ways of handling recurring jobs — which is a meaningfully different proposition from the static instruction-set GPTs that preceded them.

The integrated ecosystem

OpenAI’s claim is that agents should gather information and take action where work already happens, rather than forcing teams into a separate interface. That point becomes clearest in the Slack examples. OpenAI’s launch materials show a product-feedback agent operating inside a channel named #user-insights, answering a question about recent mobile-app feedback with a themed summary pulled from multiple sources.

The company’s demo lineup walks through a sample team directory of agents: Spark for lead qualification and follow-up, Slate for software-request review, Tally for metrics reporting, Scout for product feedback routing, Trove for third-party vendor risk, and Angle for marketing and web content.

OpenAI also shared more functional examples its own teams use internally — a Software Reviewer that checks employee requests against approved-tools policy and files IT tickets; an accounting agent that prepares parts of month-end close including journal entries, balance-sheet reconciliations, and variance analysis, with workpapers containing underlying inputs and control totals for review; and a Slack agent used by the product team that answers employee questions, links relevant documentation, and files tickets when it surfaces a new issue.

In a sense, it is a continuation of the philosophy OpenAI espoused for individuals with last week’s Codex desktop release: the agent joins the workflow where work is already happening, draws in context from the surrounding apps, takes action where permitted, and keeps moving.

From GPTs to a broader agent push

Workspace agents are not a standalone launch. They sit inside a roughly 12-month arc in which OpenAI has been systematically rebuilding ChatGPT, the API, and the developer platform around agents.

Workspace agents are explicitly positioned by OpenAI as an evolution of its custom GPTs, introduced in late 2023, which gave users a way to create customized versions of ChatGPT for particular roles and use cases.

However, now OpenAI says it is deprecating the custom GPT standard for organizations in a yet-to-be determined future date, and will require Business, Enterprise, Edu and Teachers users to update their GPTs to be new workspace agents.

Individuals who have made custom GPTs can continue using them for the foreseeable future, according to our sources at the company.

In October 2025, OpenAI introduced AgentKit, a developer-focused suite that includes Agent Builder, a Connector Registry, and ChatKit for building, deploying, and optimizing agents.

In February 2026, it introduced Frontier, an enterprise platform focused on helping organizations manage AI coworkers with shared business context, execution environments, evaluation, and permissions.

Workspace agents arrive as the no-code, in-product entry point that sits on top of that stack — even if OpenAI does not explicitly describe the architectural relationship in its materials.

The subtext across all three launches is the same: OpenAI has decided that the future of ChatGPT-for-work is fleets of permissioned agents, not single chat windows — and that GPTs, its first attempt at letting businesses customize ChatGPT, were not enough.

Governance and enterprise safeguards

Because workspace agents can act across business systems, OpenAI puts heavy emphasis on governance. Admins can control who is allowed to build, run, and publish agents, and which tools, apps, and actions those agents can reach.

The role-based controls are more granular than the ones most custom-GPT rollouts ever had: admins can toggle, per role, whether members can browse and run agents, whether they can build them, whether they can publish to the workspace directory, and — separately — whether they can publish agents that authenticate using personal credentials.

That last setting is the risky case, and OpenAI explicitly recommends keeping it narrowly scoped.

Authentication itself comes in two flavors, and the choice has real consequences. In end-user account mode, each person who runs the agent authenticates with their own credentials, so the agent only ever sees what that individual is allowed to see.

In agent-owned account mode, the agent uses a single shared connection so users don’t have to authenticate at run time. OpenAI’s documentation strongly recommends service accounts rather than personal accounts for the shared case, and flags the data-exfiltration risk of publishing an agent that authenticates as its creator.

Write actions — sending email, editing a spreadsheet, posting a message, filing a ticket — default to Always ask, requiring human approval before the agent executes.

Builders can relax specific actions to “Never ask” or configure a custom approval policy, but the default posture is human-in-the-loop.

OpenAI also claims built-in safeguards against prompt-injection attacks, where malicious content in a document or web page tries to hijack an agent. The claim is welcome but not yet proven in the wild.

For organizations that want deeper visibility, OpenAI says its Compliance API surfaces every agent’s configuration, updates, and run history.

Admins can suspend agents on the fly, and OpenAI says an admin-console view of every agent built across the organization, with usage patterns and connected data sources, is coming soon.

Two caveats worth flagging for security-sensitive buyers: workspace agents are off by default at launch for ChatGPT Enterprise workspaces pending admin enablement, and they are not available at all to Enterprise customers using Enterprise Key Management (EKM).

Analytics and early customer signal

OpenAI also ships an analytics dashboard aimed at helping teams understand how their agents are being used. Screenshots in the launch materials show measures like total runs, unique users, and an activity feed of recent runs, including one by a user named Ethan Rowe completing a run in a #b2b-sales channel.

The mockup detail supports OpenAI’s broader point: the company wants organizations to measure not just whether agents exist, but whether they are being used.

The clearest early-adopter signal in the launch itself comes from Rippling. Ankur Bhatt, who leads AI Engineering at the HR platform, says workspace agents shortened the traditional development cycle enough that a sales consultant was able to build a sales agent without an engineering team. “It researches accounts, summarizes Gong calls, and posts deal briefs directly into the team’s Slack room,” Bhatt says. “What used to take reps 5–6 hours a week now runs automatically in the background on every deal.”

OpenAI’s announcement names SoftBank Corp., Better Mortgage, BBVA, and Hibob as additional early testers.

The era of the digital coworker

Workspace agents do not land in a vacuum. They land in the middle of a broader OpenAI push — through AgentKit, through Frontier, through the Codex overhaul — to make agents more persistent, more connected, and more useful inside real organizational workflows.

They also land in a deeply crowded field: Microsoft Copilot Studio is wired into the Microsoft 365 base, Google is pushing Agentspace, Salesforce has rebuilt itself as agent infrastructure with Agentforce, and Anthropic recently introduced Claude Managed Agents, all different flavors of similar ideas — agents that cut across your apps and tools, take actions on schedules repeatedly as desired, and retain some degree of memory, context, and permissions and policies.

But this launch matters because it turns OpenAI’s strategy into something concrete for the teams already paying for ChatGPT, and because it quietly retires the product those teams were most recently told to standardize on.

If workspace agents live up to the pitch — shared, reusable, scheduled, permissioned coworkers that follow approved processes and keep work moving when their human is offline — it would mark a meaningful change in what workplace software does. Less passive software waiting for input, more active systems helping teams coordinate, execute, and move faster together.

The era of the digital coworker has begun. And, on OpenAI’s plans at least, the era of the custom GPT is ending.

Orchestration

Google and AWS split the AI agent stack between control and execution

The era of enterprises stitching together prompt chains and shadow agents is nearing its end as more options for orchestrating complex multi-agent systems emerge. As organizations move AI agents into production, the question remains: "how will we …

Orchestration

Are you paying an AI ‘swarm tax’? Why single agents often beat complex systems

Enterprise teams building multi-agent AI systems may be paying a compute premium for gains that don’t hold up under equal-budget conditions. New Stanford University research finds that single-agent systems match or outperform multi-agent architectures on complex reasoning tasks when both are given the same thinking token budget.

However, multi-agent systems come with the added baggage of computational overhead. Because they typically use longer reasoning traces and multiple interactions, it is often unclear whether their reported gains stem from architectural advantages or simply from consuming more resources.

To isolate the true driver of performance, researchers at Stanford University compared single-agent systems against multi-agent architectures on complex multi-hop reasoning tasks under equal “thinking token” budgets.

Their experiments show that in most cases, single-agent systems match or outperform multi-agent systems when compute is equal. Multi-agent systems gain a competitive edge when a single agent’s context becomes too long or corrupted.

In practice, this means that a single-agent model with an adequate thinking budget can deliver more efficient, reliable, and cost-effective multi-hop reasoning. Engineering teams should reserve multi-agent systems for scenarios where single agents hit a performance ceiling.

Understanding the single versus multi-agent divide

Multi-agent frameworks, such as planner agents, role-playing systems, or debate swarms, break down a problem by having multiple models operate on partial contexts. These components communicate with each other by passing their answers around.

While multi-agent solutions show strong empirical performance, comparing them to single-agent baselines is often an imprecise measurement. Comparisons are heavily confounded by differences in test-time computation. Multi-agent setups require multiple agent interactions and generate longer reasoning traces, meaning they consume significantly more tokens.

ddConsequently, when a multi-agent system reports higher accuracy, it is difficult to determine if the gains stem from better architecture design or from spending extra compute.

Recent studies show that when the compute budget is fixed, elaborate multi-agent strategies frequently underperform compared to strong single-agent baselines. However, they are mostly very broad comparisons that don’t account for nuances such as different multi-agent architectures or the difference between prompt and reasoning tokens.

“A central point of our paper is that many comparisons between single-agent systems (SAS) and multi-agent systems (MAS) are not apples-to-apples,” paper authors Dat Tran and Douwe Kiela told VentureBeat. “MAS often get more effective test-time computation through extra calls, longer traces, or more coordination steps.”

Revisiting the multi-agent challenge under strict budgets

To create a fair comparison, the Stanford researchers set a strict “thinking token” budget. This metric controls the total number of tokens used exclusively for intermediate reasoning, excluding the initial prompt and the final output.

The study evaluated single- and multi-agent systems on multi-hop reasoning tasks, meaning questions that require connecting multiple pieces of disparate information to reach an answer.

During their experiments, the researchers noticed that single-agent setups sometimes stop their internal reasoning prematurely, leaving available compute budget unspent. To counter this, they introduced a technique called SAS-L (single-agent system with longer thinking).

Rather than jumping to multi-agent orchestration when a model gives up early, the researchers suggest a simple prompt-and-budgeting change.

“The engineering idea is simple,” Tran and Kiela said. “First, restructure the single-agent prompt so the model is explicitly encouraged to spend its available reasoning budget on pre-answer analysis.”

By instructing the model to explicitly identify ambiguities, list candidate interpretations, and test alternatives before committing to a final answer, developers can recover the benefits of collaboration inside a single-agent setup.

The results of their experiments confirm that a single agent is the strongest default architecture for multi-hop reasoning tasks. It produces the highest accuracy answers while consuming fewer reasoning tokens. When paired with specific models like Google’s Gemini 2.5, the longer-thinking variant produces even better aggregate performance.

The researchers rely on a concept called “Data Processing Inequality” to explain why a single agent outperforms a swarm. Multi-agent frameworks introduce inherent communication bottlenecks. Every time information is summarized and handed off between different agents, there is a risk of data loss.

In contrast, a single agent reasoning within one continuous context avoids this fragmentation. It retains access to the richest available representation of the task and is thus more information-efficient under a fixed budget.

The authors also note that enterprises often overlook the secondary costs of multi-agent systems.

“What enterprises often underestimate is that orchestration is not free,” they said. “Every additional agent introduces communication overhead, more intermediate text, more opportunities for lossy summarization, and more places for errors to compound.”

On the other hand, they discovered that multi-agent orchestration is superior when a single agent’s environment gets messy. If an enterprise application must handle highly degraded contexts, such as noisy data, long inputs filled with distractors, or corrupted information, a single agent struggles. In these scenarios, the structured filtering, decomposition, and verification of a multi-agent system can recover relevant information more reliably.

The study also warns about hidden evaluation traps that falsely inflate multi-agent performance. Relying purely on API-reported token counts heavily distorts how much computation an architecture is actually spending. The researchers found these accounting artifacts when testing models like Gemini 2.5, proving this is an active issue for enterprise applications today.

“For API models, the situation is trickier because budget accounting can be opaque,” the authors said. To evaluate architectures reliably, they advise developers to “log everything, measure the visible reasoning traces where available, use provider-reported reasoning-token counts when exposed, and treat those numbers cautiously.”

What it means for developers

If a single-agent system matches the performance of multiple agents under equal reasoning budgets, it wins on total cost of ownership by offering fewer model calls, lower latency, and simpler debugging. Tran and Kiela warn that without this baseline, “some enterprises may be paying a large ‘swarm tax’ for architectures whose apparent advantage is really coming from spending more computation rather than reasoning more effectively.”

Another way to look at the decision boundary is not how complex the overall task is, but rather where the exact bottleneck lies.

“If it is mainly reasoning depth, SAS is often enough. If it is context fragmentation or degradation, MAS becomes more defensible,” Tran said.

Engineering teams should stay with a single agent when a task can be handled within one coherent context window. Multi-agent systems become necessary when an application handles highly degraded contexts.

Looking ahead, multi-agent frameworks will not disappear, but their role will evolve as frontier models improve their internal reasoning capabilities.

“The main takeaway from our paper is that multi-agent structure should be treated as a targeted engineering choice for specific bottlenecks, not as a default assumption that more agents automatically means better intelligence,” Tran said.

Orchestration

Google doesn’t pay the Nvidia tax. Its new TPUs explain why.

Every frontier AI lab right now is rationing two things: electricity and compute. Most of them buy their compute for model training from the same supplier, at the steep gross margins that have turned Nvidia into one of the most valuable companies in the world. Google does not.

On Tuesday night, inside a private gathering at F1 Plaza in Las Vegas, Google previewed its eighth-generation Tensor Processing Units. The pitch: two custom silicon designs shipping later this year, each purpose-built for a different half of the modern AI workload. TPU 8t targets training for frontier models, and TPU 8i targets the low-latency, memory-hungry world of agentic inference and real-time sampling.

Amin Vahdat, Google’s SVP and chief technologist for AI and infrastructure (pictured above left), used his time onstage to make a point that matters more to enterprise buyers than any individual spec: Google designs every layer of its AI stack end-to-end, and that vertical integration is starting to show up in cost-per-token economics that Google says its rivals cannot match.

“One chip a year wasn’t enough”: Inside Google’s 2024 bet on a two-chip roadmap

The more interesting story behind v8t and v8i is when the decision to split the roadmap was made. The call came in 2024, according to Vahdat — a year before the industry at large pivoted to reasoning models, agents and reinforcement learning as the dominant frontier workload.

At the time, it was a contrarian read. “We realized two years ago that one chip a year wouldn’t be enough,” Vahdat said during the fireside. “This is our first shot at actually going with two super high-powered specialized chips.”

For enterprise buyers, the implication is concrete. Customers running fine-tuning or large-scale training on Google Cloud and customers serving production agents on Vertex AI have been renting the same accelerators and eating the inefficiency. V8 is the first generation where the silicon itself treats those as different problems with two sets of chips.

TPU 8t: A training fabric that scales to a million chips

On paper, TPU 8t is an aggressive generational step. According to Google, 8t delivers 2.8x the FP4 EFlops per pod (121 vs 42.5) against Ironwood, the seventh-generation TPU that shipped in 2025, doubles bidirectional scale-up bandwidth to 19.2 Tb/s per chip, and quadruples scale-out networking to 400 Gb/s per chip. Pod size grows modestly from 9,216 to 9,600 chips, held together by Google’s 3D Torus topology.

The number that matters most to IT leaders evaluating where to run frontier-scale training: 8t clusters (Superpods) can scale beyond 1 million TPU chips in a single training job via a new interconnect Google is calling Virgo networking.

8t also introduces TPU Direct Storage, which moves data from Google’s managed storage tier directly into HBM without the usual CPU-mediated hops. For long training runs where wall-clock time is the cost driver, collapsing that data path reduces the number of pod-hours needed to finish each epoch.

TPU 8i and Boardfly: Re-engineering the network for agents

If 8t is an evolutionary step, TPU 8i is the more architecturally interesting chip. It is also where the story for IT buyers gets most compelling.

The year-over-year spec jumps are, as Vahdat put it, “stunning.” According to Google, 8i delivers 9.8x the FP8 EFlops per pod (11.6 vs 1.2), 6.8x the HBM capacity per pod (331.8 TB vs 49.2), and a pod size that grows 4.5x from 256 to 1,152 chips.

What drove those numbers is a rethink of the network itself. Vahdat explained the insight directly: Google’s default way of connecting chips together supported bandwidth over latency — good for moving large amounts of data through, not built for the minimum time it takes a response to get back. That profile works for training. For agents, it does not. In partnership with Google DeepMind, the TPU team built what Google calls Boardfly topology specifically to reduce the network diameter — shrinking the number of hops between any two chips in a pod. Paired with a Collective Acceleration Engine and what Google describes as very large on-chip SRAM, 8i delivers a claimed 5x improvement in latency for real-time LLM sampling and reinforcement learning.

The vertical-integration moat: Why Google doesn’t pay the “Nvidia tax”

The subtext across Vahdat’s presentation was a six-layer diagram Google calls its AI stack: energy at the foundation, then data center land and enclosures, AI infrastructure hardware, AI infrastructure software, models (Gemini 3), and services on top. Vahdat noted that designing each layer in isolation forces you to the least common denominator for each layer. Google designs them together.

This is where the competitive story for IT buyers and analysts crystallizes. OpenAI, Anthropic, xAI and Meta all depend heavily on Nvidia silicon to train their frontier models. Every H200 and Blackwell GPU they buy carries Nvidia’s data-center gross margin — the informal “Nvidia tax” that industry analysts have flagged for two years running as a structural cost disadvantage for anyone renting rather than designing. Google pays fab, packaging and engineering costs on its TPUs. It does not pay that margin.

What v8 means for the compute race: A new evaluation checklist for IT leaders

For procurement and infrastructure teams, TPUv8 reframes the 2026–2027 cloud evaluation in concrete ways.

Teams training large proprietary models should look at 8t availability windows, Virgo networking access, and goodput SLAs — not just headline EFlops. Teams serving agents or reasoning workloads should evaluate 8i availability on Vertex AI, independent latency benchmarks as they emerge, and whether HBM-per-pod sizing fits their context windows. Teams consuming Gemini through Gemini Enterprise should inherit the 8i lift and should expect the ceiling on what they can deploy in production to rise meaningfully through 2026.

The caveats are real. General availability is still “later in 2026.” The v8 is a roadmap signal, not a procurement decision today. Google’s benchmarks are self-reported; undoubtedly independent numbers will come from early cloud customers and third-party evaluators over the next two quarters. And portability between JAX/XLA and the CUDA/PyTorch ecosystem remains a friction cost worth thinking about when negotiating any multi-year commitment.

Looking further out, Vahdat made two predictions worth noting. First, general-purpose CPUs will see a resurgence inside AI systems — not as accelerators, but as orchestration compute for agent sandboxes, virtual machines and tool execution. Second, framed explicitly as an industry prediction rather than a Google roadmap preview, specialization also keeps going strong. As general-purpose CPUs gain plateau at a few percent a year, workloads that matter will demand purpose-built silicon. “Two chips might become more,” Vahdat said — without specifying whether the “more” would mean future TPU variants or other classes of specialized accelerators.

The frontier compute race used to be a question of who could buy the most H100s. It is now a question of who controls the stack. The shortlist of companies that genuinely do is, for the moment, two: Google and Nvidia.

Orchestration

Salesforce’s Agentforce Vibes 2.0 targets a hidden failure: context overload in AI agents

When startup fundraising platform VentureCrowd began deploying AI coding agents, they saw the same gains as other enterprises: they cut the front-end development cycle by 90% in some projects.

However, it didn’t come easy or without a lot of trial and error.

VentureCrowd’s first challenge revolved around data and context quality, since Diego Mogollon, chief product officer at VentureCrowd, told VentureBeat that “agents reason against whatever data they can access at runtime” and would then be confidently “wrong” because they’re only basing their knowledge on the context given to them.

Their other roadblock, like many others, was messy data and unclear processes. Similar to context, Mogollon said coding agents would amplify bad data, so the company had to build a well-structured codebase first.

“The challenges are rarely about the coding agents themselves; they are about everything around them,” said Mogollon. “It’s a context problem disguised as an AI problem, and it is the number one failure mode I see across agentic implementations.”

Mogollon said VentureCrowd encountered several roadblocks in overhauling its software development.

VentureCrowd’s experience illustrates a broader issue in AI agent development. The models are not failing the agents; rather, they become overwhelmed by too much context and too many tools at once.

Too much context

This comes from a phenomenon called Context bloat, when AI systems accumulate more and more data, tools or instructions, the more complex the workflows become.

The problem arises because agents need context to work better, but too much of it creates noise. And the more context an agent has to sift through, the more tokens it uses, the work slows down and the costs increase.

One way to curb context bloat is through context engineering. Context engineering helps agents understand code changes or pull requests and align them with their tasks.

However, context engineering often becomes an external task rather than built into the coding platforms enterprises use to build their agents.

How coding agent providers respond

VentureCrowd relied on one solution in particular to help it overcome the issues with context bloat plaguing its enterprise AI agent deployment: Salesforce’s Agentforce Vibes, a coding platform that lives within Salesforce and is available for all plans starting with the free one.

Salesforce recently updated Agentforce Vibes to version 2.0, expanding support for third-party frameworks like ReAct. Most important for companies like VentureCrowd, Agentforce Vibes added Abilities and Skills, which they can use to direct agent behavior.

“For context, our entire platform, frontend and backend, runs on the Salesforce ecosystem. So when Agentforce Vibes launched, it slotted naturally into an environment we already knew well,” Mogollon said.

Salesforce’s approach doesn’t minimize the context agents use; rather, it helps enterprises ensure that context stays within their data models or codebases. Agentforce Vibes adds additional execution through the new Skills and Abilities feature. Abilities define what agents want to accomplish, and Skills are the tools they will use to get there.

Other coding agent platforms manage context differently. For example, Claude Code and OpenAI’s Codex focus on autonomous execution, continuously reading files, running commands and as tasks evolve, expanding context. Claude Code has a context indicator that which compacts context when it becomes too large.

With these different approaches, the consistent pattern is that most systems manage growing contexts for agents, not necessarily to limit them. Context keeps growing, especially as workflows become more complex, making it more difficult for enterprises to control costs, latency and reliability.

Mogollon said his company chose Agentforce Vibes not only because a large portion of their data already lives on Salesforce, making it easier to integrate, but also because it would allow them to control more of the context they feed their agents.

What builders should know

There’s no single way to address context bloat, but the pattern is now clear: more context doesn’t always mean better results.

Along with investing in context engineering, enterprises have to experiment with the context constraint approach they are most comfortable with. For enterprises, that means the challenge isn’t just giving agents more information—it’s deciding what to leave out.

Orchestration

Google’s new Deep Research and Deep Research Max agents can search the web and your private data

Google on Monday unveiled the most significant upgrade to its autonomous research agent capabilities since the product’s debut, launching two new agents — Deep Research and Deep Research Max — that for the first time allow developers to fuse open web data with proprietary enterprise information through a single API call, produce native charts and infographics inside research reports, and connect to arbitrary third-party data sources through the Model Context Protocol (MCP).

The release, built on Google’s Gemini 3.1 Pro model, marks an inflection point in the rapidly intensifying race to build AI systems that can autonomously conduct the kind of exhaustive, multi-source research that has traditionally consumed hours or days of human analyst time. It also represents Google’s clearest bid yet to position its AI infrastructure as the backbone for enterprise research workflows in finance, life sciences, and market intelligence — industries where the stakes of getting information wrong are extraordinarily high.

“We are launching two powerful updates to Deep Research in the Gemini API, now with better quality, MCP support, and native chart/infographics generation,” Google CEO Sundar Pichai wrote on X. “Use Deep Research when you want speed and efficiency, and use Max when you want the highest quality context gathering & synthesis using extended test-time compute — achieving 93.3% on DeepSearchQA and 54.6% on HLE.”

Both agents are available starting today in public preview via paid tiers of the Gemini API, accessible through the Interactions API that Google first introduced in December 2025.

Why Google built two research agents instead of one

The launch introduces a tiered architecture that reflects a fundamental tension in AI agent design: the tradeoff between speed and thoroughness.

Deep Research, the standard tier, replaces the preview agent Google released in December and is optimized for low-latency, interactive use cases. It delivers what Google describes as significantly reduced latency and cost at higher quality levels compared to its predecessor. The company positions it as ideal for applications where a developer wants to embed research capabilities directly into a user-facing interface — think a financial dashboard that can answer complex analytical questions in near-real time.

Deep Research Max occupies the opposite end of the spectrum. It leverages extended test-time compute — a technique where the model spends more computational cycles iteratively reasoning, searching, and refining its output before delivering a final report. Google designed it for asynchronous, background workflows: the kind of task where an analyst team kicks off a batch of due diligence reports before leaving the office and expects exhaustive, fully sourced analyses waiting for them the next morning.

The Google DeepMind team framed the distinction on X: “Deep Research: Optimized for speed and efficiency. Perfect for interactive apps needing quicker responses. Deep Research Max: It uses extra time to search and reason. Ideal for exhaustive context gathering and tasks happening in the background.”

“Deep Research was our first hosted agent in the API and has gained a ton of traction over the last 3 months, very excited for folks to test out the new agents and all the improvements, this is just the start of our agents journey,” Logan Kilpatrick, who leads developer relations for Google’s AI efforts, wrote on X.

MCP support lets the agents tap into private enterprise data for the first time

Perhaps the most consequential feature in today’s release is the addition of Model Context Protocol support, which transforms Deep Research from a sophisticated web research tool into something more closely resembling a universal data analyst.

MCP , an emerging open standard for connecting AI models to external data sources, allows Deep Research to securely query private databases, internal document repositories, and specialized third-party data services — all without requiring sensitive information to leave its source environment. In practical terms, this means a hedge fund could point Deep Research at its internal deal-flow database and a financial data terminal simultaneously, then ask the agent to synthesize insights from both alongside publicly available information from the web.

Google disclosed that it is actively collaborating with FactSet, S&P, and PitchBook on their MCP server designs, a signal that the company is pursuing deep integration with the data providers that Wall Street and the broader financial services industry already rely on daily. The goal, according to the blog post authored by Google DeepMind product managers Lukas Haas and Srinivas Tadepalli, is to “let shared customers integrate financial data offerings into workflows powered by Deep Research, and to enable them to realize a leap in productivity by gathering context using their exhaustive data universes at lightning speed.”

This addresses one of the most persistent pain points in enterprise AI adoption: the gap between what a model can find on the open internet and what an organization actually needs to make decisions. Until now, bridging that gap required significant custom engineering. MCP support, combined with Deep Research’s autonomous browsing and reasoning capabilities, collapses much of that complexity into a configuration step. Developers can now run Deep Research with Google Search, remote MCP servers, URL Context, Code Execution, and File Search simultaneously — or turn off web access entirely to search exclusively over custom data. The system also accepts multimodal inputs including PDFs, CSVs, images, audio, and video as grounding context.

Native charts and infographics turn AI reports into stakeholder-ready deliverables

The second headline feature — native chart and infographic generation — may sound incremental, but it addresses a practical limitation that has constrained the usefulness of AI-generated research outputs in professional settings.

Previous versions of Deep Research produced text-only reports. Users who needed visualizations had to export the data and build charts themselves, a friction point that undermined the promise of end-to-end automation. The new agents generate high-quality charts and infographics inline within their reports, rendered in HTML or Google’s Nano Banana format, dynamically visualizing complex datasets as part of the analytical narrative.

“The agent generates HTML charts and infographics inline with the report. Not screenshots. Not suggestions to ‘visualize this data.’ Actual rendered charts inside the markdown output,” noted AI commentator Shruti Mishra on X, capturing the practical significance of the change.

For enterprise users — particularly those in finance and consulting who need to produce stakeholder-ready deliverables — this transforms Deep Research from a tool that accelerates the research phase into one that can potentially produce near-final analytical products. Combined with a new collaborative planning feature that lets users review, guide, and refine the agent’s research plan before execution, and real-time streaming of intermediate reasoning steps, the system gives developers granular control over the investigation’s scope while maintaining the transparency that regulated industries demand.

How Deep Research evolved from a consumer chatbot feature to enterprise platform infrastructure

Today’s release crystallizes a strategic narrative Google has been building for months: Deep Research is not merely a consumer feature but a piece of infrastructure that powers multiple Google products and is now being offered to external developers as a platform.

The blog post explicitly notes that when developers build with the Deep Research agent, they tap into “the same autonomous research infrastructure that powers research capabilities within some of Google’s most popular products like Gemini App, NotebookLM, Google Search and Google Finance.” This suggests that the agent available through the API is not a stripped-down version of what Google uses internally but the same system, offered at platform scale.

The journey to this point has been remarkably rapid. Google first introduced Deep Research as a consumer feature in the Gemini app in December 2024, initially powered by Gemini 1.5 Pro. At the time, the company described it as a personal AI research assistant that could save users hours by synthesizing web information in minutes. By March 2025, Google upgraded Deep Research with Gemini 2.0 Flash Thinking Experimental and made it available for anyone to try. Then came the upgrade to Gemini 2.5 Pro Experimental, where Google reported that raters preferred its reports over competing deep research providers by more than a 2-to-1 margin. The December 2025 release was the pivot to developer access, when Google launched the Interactions API and made Deep Research available programmatically for the first time, powered by Gemini 3 Pro and accompanied by the open-source DeepSearchQA benchmark.

The underlying model driving today’s improvements is Gemini 3.1 Pro, which Google released on February 19, 2026. That model represented a significant leap in core reasoning: on ARC-AGI-2, a benchmark evaluating a model’s ability to solve novel logic patterns, 3.1 Pro scored 77.1% — more than double the performance of Gemini 3 Pro. Deep Research Max inherits that reasoning foundation and layers autonomous research behaviors on top of it, achieving 93.3% on DeepSearchQA (up from 66.1% in December) and 54.6% on Humanity’s Last Exam (up from 46.4%).

Google faces a crowded field of competitors building autonomous research agents

Google is not operating in a vacuum. The launch arrives amid intensifying competition in the autonomous research agent space. OpenAI has been developing its own agent capabilities within ChatGPT under the codename Hermes, which includes an agent builder, templates, scheduling, and Slack integration, according to reports circulating on social media. Perplexity has built its business around AI-powered research. And a growing ecosystem of startups is attacking various slices of the automated research workflow.

What distinguishes Google’s approach is the combination of its search infrastructure — which gives Deep Research access to the broadest and most current index of web information available — with the MCP-based connectivity to enterprise data sources. No other company currently offers a research agent that can simultaneously query the open web at Google Search’s scale and navigate proprietary data repositories through a standardized protocol. The pricing structure also signals Google’s intent to drive adoption: according to Sim.ai, which tracks model pricing, the Deep Research agent in the December preview was priced at $2 per million input tokens and $2 per million output tokens with a 1 million token context window — positioning it as cost-competitive for the volume of research output it generates.

Not everyone greeted the announcement with unalloyed enthusiasm, however. Several users on X noted that the new agents are available only through the API, not in the Gemini consumer app. “Not on Gemini app,” observed TestingCatalog News, while another user wrote, “Google keeps punishing Gemini App Pro subscribers for some reason.” Others raised concerns about the presentation of benchmark results, with one user arguing that Google’s charts could be “misleading” in how they represent percentage improvements. These complaints point to a broader tension in Google’s AI strategy: the company is increasingly directing its most advanced capabilities toward developers and enterprise customers who access them through APIs, while consumer-facing products sometimes lag behind.

What Deep Research Max means for finance, biotech, and the future of knowledge work

The practical implications of today’s launch are most immediately felt in industries that depend on exhaustive, multi-source research as a core business function. In financial services, where analysts routinely spend hours assembling due diligence reports from scattered sources — SEC filings, earnings transcripts, market data terminals, internal deal memos — Deep Research Max offers the possibility of automating the initial research phase entirely. The FactSet, S&P, and PitchBook partnerships suggest Google is serious about making this work with the data infrastructure that financial professionals already use.

In life sciences, the blog post notes that Google has collaborated with Axiom Bio, which builds AI systems to predict drug toxicity, and found that Deep Research unlocked new levels of initial research depth across biomedical literature. In market research and consulting, the ability to produce stakeholder-ready reports with embedded visualizations and granular citations could compress project timelines from days to hours.

The key question is whether the quality and reliability of these automated outputs will meet the standards that professionals in these fields demand. Google’s benchmark numbers are impressive, but benchmarks measure performance on standardized tasks — real-world research is messier, more ambiguous, and often requires the kind of judgment that remains difficult to automate. Deep Research and Deep Research Max are available now in public preview via paid tiers of the Gemini API, with availability on Google Cloud for startups and enterprises coming soon.

Eighteen months ago, Deep Research was a feature that helped grad students avoid drowning in browser tabs. Today, Google is betting it can replace the first shift at an investment bank. The distance between those two ambitions — and whether the technology can actually close it — will define whether autonomous research agents become a transformative category of enterprise software or just another AI demo that dazzles on benchmarks and disappoints in the conference room.

Orchestration, technology

The AI governance mirage: Why 72% of enterprises don’t have the control and security they think they do

Decision makers at 72% of organizations claim to have two or more AI platforms that they identify as their “primary” layer, according to a survey of 40 enterprise companies conducted by VentureBeat last month, revealing real gaps in security and control.

For enterprise management and technical leaders, and especially security leaders, these multiple AI platforms extend the attack surfaces of most enterprises at a time when AI-driven attacks have become increasingly potent.

The multiple platforms — which include offerings from hyperscaler or AI labs like Microsoft Azure, Google, OpenAI or Anthropic, or big application companies like Epic, Workday or ServiceNow — reflect a state of sprawl that has emerged as these big software providers rush to offer their own AI to their enterprise customers.

Those customers, in their own rush to scale AI, are finding they aren’t building a singular strategy — in fact they may be building a collection of contradictions.

The strategic paradox: why leading enterprises are building around their vendors

For example, take the strategic paradox faced by Mass General Brigham (MGB) hospital system, which has 90,000 employees and is the largest employer in Massachusetts. The hospital system last year had to shut down an uncontrolled number of internal proof of concepts that had sprouted up as employees had gotten carried away with AI projects, said CTO Nallan “Sri” Sriraman at the VentureBeat AI Impact event in Boston on March 26, which focused on the challenges of scaling AI.

Instead, the company decided it was better to wait for the software giants it already uses to deliver on their AI roadmaps. Since these companies have so many resources, and were making AI a top priority themselves, it made no sense for MGB to try to build its own AI layer that would be duplicative, he said. “Why are we building it ourselves?” he asked. “Leverage it.”

Yet, even then, Sriraman’s team has been forced to build workarounds, where those companies haven’t done enough.

For example, MGB has just completed a “full-scaled” custom build around Microsoft’s Copilot — to get essentially everything offered by that tool — by putting a “skin” around Copilot to handle the safety and data privacy concerns the major model providers haven’t yet mastered. Specifically, MGB needed a way for employees to prompt the AI and not have their protected health information (PHI) leaked back to the Copilot LLM provider, OpenAI. The new secure platform, which can support up to 30,000 users, is really the ultimate contradiction: Even though the company has a mandate to leverage the AI provided by the bigger companies, it needs to build around its failures.

The contradiction goes even further. These software vendors used by MGB — which also include Epic, Workday and ServiceNow — are all now building agents for their AI, all operating differently. So MGB has to invest in building a “control plane that coordinates and orchestrates all of these agents,” Sriraman said. “That’s where our investment is going to be.”

He noted that companies like his are “discovering and experimenting as the landscape keeps shifting.” The marketplace is “still nascent,” he said, which makes decisions difficult.

The “six blind men” problem

Sriraman explained the current vendor landscape with an analogy: “When you ask six blind men to touch an elephant and say, what does this elephant look like?” Sriraman said. “You’re gonna get six different answers.”

What emerges from the research VentureBeat conducted in the first quarter, along with conversations like the one in Boston, is a situation that we at VentureBeat are calling a “governance mirage.” While many enterprises say they have adequate governance, in reality they haven’t created clear accountability or specific guardrails, evaluations or security processes to ensure that governance.

The data of disconnect: confidence vs. systematic oversight

The research comes from surveys across January, February and March by VentureBeat of enterprise companies with 100 or more employees, with 40 to 70 qualified respondents per topic area — covering agentic orchestration, AI security, RAG and governance. The data lacks statistical significance in many areas and should be treated as directional.

The research on governance found that a majority, or 56%, of respondents said they are “very confident” that they’d detect a misbehaving AI model, suggesting that most decision-makers believe they have sufficient basic governance at their companies.

However, nearly a third of respondents have no systematic mechanism to detect AI misbehavior until it surfaces through users or audits. In a world where telemetry leakage accounts for 34% of GenAI incidents (Wiz), and the global average breach cost has hit $4.4M (IBM 2025 Cost of a Data Breach), finding out after the damage is done is the default for too many companies.

Moreover, 43% of respondents say a central team owns AI governance. That sounds reassuring — until you look at what’s happening everywhere else. Twenty-three percent say governance is unclear or actively contested between teams. Twenty percent say each platform team governs independently. Six percent say no one has formally addressed it. The rest said they were unsure who owned it.

More telling is the barrier data. When asked about the single biggest obstacle to governing AI across platforms, “no single owner or accountable team” ranked second at 29% — just behind vendor opacity. Accountability structure and lack of vendor transparency are the two dominant failure modes, and they compound each other: Without a central owner, no one has the mandate to demand transparency from the vendors.

The day-two bill: managing sprawl, creep, and lock-in

The scaling trap: Red Hat’s warning

Brian Gracely, Senior Director at Red Hat, who also spoke at the VentureBeat Boston event last month, addressed the infrastructure side of this sprawl, warning that many enterprises are falling into a trap of deceptive initial wins.

Gracely noted that the barrier to entry is almost nonexistent at the start, with nearly anyone able to spin up a project using a credit card and an API key. “Day zero is very, very easy,” Gracely said. “Day two is when the bill comes due.”

Red Hat is positioning its software layer (OpenShift AI) as the necessary buffer to prevent enterprises from getting buried in a single provider’s proprietary ecosystem. Gracely’s point is direct: If your control system is built entirely inside one cloud provider’s toolset, you are effectively “renting a cage.” The illusion of speed in the early pilot phase often hides a technical debt that becomes obvious the moment you try to move your AI work to a different platform.

Gracely illustrated this with a recent example. A senior leader from Red Hat’s centralized CTO office spent part of her vacation contributing to an open-source agent project called OpenClaw, which became widely popular in the first quarter. Within days of her name appearing as a project maintainer, Red Hat was fielding calls from major New York banks. Their problem was immediate: They realized they already had upwards of 10,000 employees bringing “claws” — agent-based tools — into their infrastructure with zero centralized oversight.

Breaches caused by employees working on these sorts of unapproved technologies are costly. These so-called “shadow AI” incidents cost on average $670K more than standard incidents, according to IBM.

Red Hat’s Gracely noted that while organizations can try to shut down these unapproved ports, they eventually have to figure out how to make them productive and secure — a task that requires a serious investment in an orchestration or platform layer.

The dynamic defensive: MassMutual’s refusal to bet

While some enterprise companies seek an “AI operating system” that oversees all of their AI technologies and apps, others are simply refusing to sign the check. Sears Merritt, CIO and head of enterprise technology at MassMutual, is managing the governance conundrum by intentionally staying in a state of high-velocity flexibility.

“Things are so dynamic, it’s hard to know which of the AI vendors will end up on top,” Merritt said at the Boston event. For that reason, MassMutual is refusing to enter any long-term contracts with AI vendors. Merritt’s strategy of “dynamic defensive” highlights a core finding of our research: Vendor popularity is changing radically month to month.

Anthropic, for example, went from 0% in January to nearly 6% in February, in the number of respondents reporting what agent orchestration technology they were using. Again, the sample size was small, at 70 respondents. Still, even if directional, the dynamic landscape suggests picking a “primary” winner today is a fool’s errand.

The January figure likely reflects survey composition: Respondents represent the broader enterprise market, not the developer community where Anthropic has seen its strongest early traction.

Until recently, most organizations had signed up early with leaders like Microsoft and OpenAI as their main orchestration providers, due to their early lead with Copilot. Our finding that Anthropic is just now pushing into enterprise agent orchestration may be a confirmation of the recent excitement around that platform.

One possible explanation is that enterprises already using Claude for model inference are now routing through Anthropic’s native tooling rather than third-party frameworks — though the sample is too small to draw firm conclusions.

The rise of “platform creep”

The leading providers are also shifting toward “managed agents,” as reflected by Anthropic’s recent announcement. This offering suggests possible continued platform creep, whereby providers like OpenAI and Anthropic take over more and more of the AI infrastructure — most specifically, in this case, the memory of agentic session details. And there the trap is set. Once your session data and orchestration live inside a provider’s proprietary database, you aren’t just using a model; you are living in its ecosystem.

Moreover, persistent agent memory is a prime target for memory poisoning via injected instructions that influence every future interaction. And when that memory lives in a provider’s database, you lose your own forensic capability.

The security irony: The fox guarding the hen house

We are seeing this platform creep in our data as well. The most jarring finding in our Q1 data is what we call the “Security Irony”: the fact that the providers most responsible for creating enterprise AI risk are the same ones enterprises are using to manage it.

Respondents said the top selection criterion for AI orchestration platforms was “security and permissions generally” (37.1%), beating out other criteria like cost, flexibility, control and ease of development. Yet, the market is choosing convenience over sovereignty. According to our survey, 26% of enterprises in February were using OpenAI as their primary security solution — the very same provider whose models create the risks they are trying to secure. That trend only seemed to strengthen in March, though, as stated before, we want to be careful. Our sample size is small, and this data should only be taken as directional.

It’s not clear whether enterprises are choosing OpenAI as a security solution, or just relying on its built-in security features offered by Microsoft Azure (which partnered with OpenAI when it pushed its Copilot solution aggressively in 2024) because customers were already on that platform.

Beyond the data, there are anecdotal signs that OpenAI’s enterprise position may be shifting. Anthropic’s Claude Code drew significant attention among developers early this year alongside the Claude 4.6 model. The subsequent announcement of Mythos, its security-focused model, prompted interest from enterprise security teams given its ability to identify vulnerabilities. OpenAI has also announced a security-focused model, GPT-5.4-Cyber.

Our data may also point to a drop in OpenAI’s relative position in a few enterprise AI categories. One area was data-retrieval, where OpenAI again leads among third-party providers, but we saw an increase in the number of respondents instead using in-house solutions for retrieval — perhaps a sign that AI models and agents are getting better at natively being able to use tools to call directly to companies’ existing databases, and that custom code is often a way companies are building this in. However, here again we feel our data is at best directional for now.

We are asking the fox to guard the hen house. Hyperscaler security features (like those from OpenAI, Azure, and Google) are winning, because they are already integrated into the platforms enterprises are using. But it creates a single-provider dependency. As agents gain the power to modify documents, call APIs and access databases, the “governance mirage” suggests we have control, while the data shows we are simply clicking “I agree” on whatever the hyperscalers offer. The resulting risks, however, include content injection, privilege escalation and data exfiltration.

The path forward: toward a unified control plane

The search for the “Dynatrace for AI”

So, what is the way out? Sriraman argued that the industry desperately needs a “central observability platform” — a “Dynatrace for AI” — that provides full end-to-end visibility, including model drift and safety prompting, agent behavior analytics, privilege escalation alerts, and forensic logging. He is currently working with a number of potential providers to deliver on this.

The “swivel chair” warning

Sriraman warned that without a unified control plane, enterprises are at risk of sliding back into a fragmented “swivel chair” world — reminiscent of the early, inefficient days of Robotic Process Automation (RPA) — where employees are forced to constantly jump between different siloed AI tools to finish a single workflow. “We don’t want to create a world where you have to switch to do something here and then go back to the platform to do something else,” he said.

But that desire for a single control plane conflicts with the desire to avoid lock-in. Our data shows the market has settled on the “hybrid control plane.” In other words, the most popular situation among our respondents (at 34.3%), was to use model provider-native solutions like Copilot Studio or OpenAI assistants for some workflows, while also running external options like LangGraph or custom orchestration for others. Smaller numbers of companies reported being more dogmatic here, whether that be deliberately removing the model provider from the orchestration layer entirely, relying only on custom orchestration tools, or relying only on the model provider’s technology

Enterprises trust no single provider enough to give them full control, yet they lack the engineering capacity to build entirely from scratch.

The bottom line: The “big red button”

Visibility and integration are only half the battle. In a high-stakes industry like healthcare, Sriraman argues that any legitimate control plane must also offer a hard-stop capability. “We need a big red button,” he said. “Kill it. We should be able to have that … without that, don’t put anything in the operational setting.” In fact, such a kill switch was formally called for by the security community group OWASP as part of a recommended security framework.

The “governance mirage” is the belief that you can scale AI without deciding who owns the control and security plane.

If you are one of the 72% of organizations claiming multiple “primary” platforms, be careful because you may not have a strategy; you may have a conflict of interest. It suggests that the winner of the war between the AI behemoths — OpenAI, Anthropic, Google, Microsoft, etc. — won’t necessarily be the one with the best model, but the one that manages to sit above the models and help enterprises enforce a single version of the truth. That may be difficult to achieve, though, given that companies won’t want lock-in with a single player.

The data suggests enterprises are already resisting that outcome — and may need to formalize that resistance. Enterprises arguably need to own their control plane with independent security instrumentation, not wait for a vendor to win that role for them.

Orchestration

Kimi K2.6 runs agents for days — and exposes the limits of enterprise orchestration

Most orchestration frameworks were built for agents that run for seconds or minutes. Now that agents are running for hours — and in some cases days — those frameworks are starting to crack.

Several model providers, such as Anthropic with Claude Code and OpenAI with Codex, introduced early support for long-horizon agents through multi-session tasks, subagents and background execution. However, these systems sometimes assume agents are still operating within bounded-time workflows even when they run for extended periods.

Open-source model provider Moonshot AI wants to push beyond that with its new model, Kimi K2.6.

Moonshot says the model is designed for continuous execution, with internal use cases including agents that ran for hours and, in one case, five straight days, handling monitoring and incident response autonomously.

But this growing use of this type of agent is exposing a critical gap in orchestration: most orchestration frameworks were not designed for this type of continuous, stateful execution. Open-source models, such as Kimi K2.6, that rely on agent swarms are making the case that their orchestration approach comes close to managing stateful agents.

The difficulties of orchestrating long-running agents

While it is true that some enterprises would rather bring their own orchestration frameworks to their agentic ecosystem, model providers and agent platforms recognize that offering agent management remains a competitive advantage.

Other model providers have begun exploring long-running agents, many through multi-session tasks and background execution. For example, Anthropic’s Claude Code orchestrates agents with a lead agent that directs other agents based on a set of user-instructed definitions. OpenAI’s Codex runs similarly.

Kimi K2.6 approaches orchestration with an improved version of its Agent Swarms, capable of managing up to 300 sub-agents “executing across 4,000 coordinated steps simultaneously,” Moonshot AI wrote in a blog post. Compared to both Claude Code and Codex, K2.6 relies on the model, rather than pre-defined roles, to determine orchestration.

Kimi K2.6 is now available on Hugging Face, through its API, Kimi Code and the Kimi app.

Practitioners experimenting with long-horizon agents say the brittleness runs deeper than prompting can fix.

As one practitioner, Maxim Saplin, put it in a blog post, “That does not mean subagents are useless. It means orchestration is still fragile. Right now, it feels more like a product and training problem than something you can solve by writing a sufficiently stern prompt.”

The problem long-running agents pose is that it’s difficult to maintain their state, especially as their environment continues to change while they’re doing their job. The agent would constantly call different tools and APIs or tap into different databases during its runtime. Most current agents, those that may run for one or two executions, do call different tools, but for at most a minute.

Mark Lambert, chief product officer at ArmorCode, which builds an autonomous security platform for enterprises, told VentureBeat in an email that the governance gap is already outpacing deployment.

“These agentic systems can now generate code and system changes faster than most organizations can review, remediate, or govern them. This will require more than just additional scanning. Organizations will need stronger AI governance that provides the context, prioritization, and accountability teams need to manage Kimi and other AI-generated risk before they turn into accumulated exposure,” Lambert said.

Long-running agents could also risk failure without a clear rollback. Most importantly, these types of agents often lack a set of well-defined tasks and dynamically adjust their plans as they run.

Kunal Anand, chief product officer at F5, told VentureBeat in an email that long-horizon agents represent a much bigger architectural shift than most companies were prepared for.

“We went from scripts to services to containers to functions, and now to agents as persistent infrastructure. That creates categories we do not yet have good names for: agent runtime, agent gateway, agent identity provider, agent mesh. The API gateway pattern is morphing into something that has to understand goals and workflows, not just endpoints and verbs,” Anand said.

Running for 13 hours and even five days

Understanding how to orchestrate agents becomes important because model capabilities have begun to outpace orchestration innovations, even as enterprises start to look at long-horizon agents.

Moonshot AI says the model is built for tasks that reflect “real-world challenges that typically demand weeks or months of collective human effort.” In a separate technical document provided to VentureBeat, Moonshot claims K2.6 built a full SysY compiler from scratch in 10 hours — work it characterized as equivalent to a team of four engineers over two months — and passed all 140 functional tests without human intervention.

The team deployed K2.6 to complex engineering tasks, including overhauling an eight-year-old open source financial matching engine. Moonshot’s engineers described a 13-hour execution that “iterated through 12 optimization strategies, initiating over 1,000 tool calls to modify more than 4,000 lines of code precisely.”

Moonshot said one of its teams used K2.6 to build an agent that ran autonomously for five days. That agent managed monitoring, incident response and system operations.

Orchestration

What AI model should you use for revenue intelligence? Von says all the big ones, and it will automate mixing and matching for you

Looking at enterprise AI adoption, VentureBeat has anecdotally observed a fairly wide divergence when it comes to specific roles: For those who build—engineers and developers—the arrival of AI has been transformative, moving through the workflow with the speed of tools like Claude Code and Cursor to automate the heavy lifting of syntax and architecture.

Yet, for those who sell, the “revenue stack” has remained a fragmented collection of data silos, manual CRM entries, and anecdotal reporting.

Von, a new AI platform emerging from the team behind process automation startup Rattle, aims to bridge this gap. By positioning itself not as another “point solution” but as a foundational “intelligence layer,” Von seeks to do for Go-To-Market (GTM) teams what the modern IDE has done for the developer: provide a single, reasoning interface that understands the entire business context.

“AI has revolutionized the workflow for people who build things, but there is nothing that has revolutionized the workflow for people who sell those things,” Von CEO Sahil Aggarwal said in a recent video call interview with VentureBeat. “That is what we are trying to build with Von”.

Technology: The context graph and multi-model engine

At the core of Von’s capability is a departure from the traditional “search bar” approach to enterprise AI. While standard LLMs often struggle with the sprawling, unstructured nature of sales data, Von begins its deployment by building a “context graph” of a company’s entire business.

This process involves ingesting structured data from CRMs like Salesforce and HubSpot, alongside unstructured data from call recorders (Gong, Zoom, Chorus), email threads, and internal documentation.

“Once Von builds this context graph, it will understand your business better than anyone else in the company,” Aggarwal said.

This understanding is rooted in a company’s specific “ontology”—the unique language of its deal stages, territory definitions, and institutional knowledge.

“We train these foundational models on a company’s own business and ontology to make the model work for them,” the CEO addded.

Instead of relying on a single large language model, Von utilizes a “mixture of models” strategy to optimize performance and cost. In this architecture, Anthropic’s Claude is deployed for high-level reasoning and “thinking,” ChatGPT handles bulk data processing, and Google’s Gemini is utilized for generating creative assets such as decks and reports.

This technical approach allows Von to resolve a common frustration in Sales Operations: the gap between what is logged in a CRM and what actually happened in a meeting. By cross-referencing call transcripts with Salesforce records, the system can identify discrepancies in “lost reasons” or verify deal health based on sentiment rather than just a rep’s manual update.

From reporting queues to AI headcount

Von is designed to function as an “AI Data Scientist” or a “VP of RevOps” that lives on top of the enterprise’s existing revenue tracking tools.

During an initial product demonstration, Aggarwal showed how the platform could analyze 101 SMB accounts to identify churn risk in just over three minutes—a task he estimates would take a human analyst one to two weeks.

The platform’s primary interface resembles a chat environment, but the outputs are designed to be actionable revenue assets. Key functionalities include:

Deal Health Monitoring: Cross-referencing calls and emails to surface “risky” commits that might otherwise go unnoticed until the end of a quarter.
Automated Briefing: Generating pre-call context docs that draw from the entire history of an account, ensuring reps are briefed on every previous touchpoint.
Win/Loss Analysis: Clustered analysis of transcripts to find the “true” reasons for lost deals, often finding that the recorded reason in the CRM does not match the customer’s actual feedback.
Revenue Operations Automation: Handling “low-level” Salesforce admin tasks, such as creating flows, validation rules, or cleaning up account territories.

The goal is to shift Revenue Operations (RevOps) from a “reporting queue” that handles ad-hoc data requests into an infrastructure layer.

As Kieran Snaith, SVP of Revenue Operations at Qualified, noted in a Von testimonial blog post, the goal is to allow leaders to “run the business in chat,” asking complex questions about forecast confidence or pipeline risk and receiving data-backed answers instantly.

Pivoting into ‘the next Salesforce’

Von is operated by Rattle Software Inc., a company that previously found success with “Rattle,” a mid-seven-figure revenue business focused on Salesforce-Slack integrations. Aggarwal describes Von as a significant pivot toward a larger opportunity, aiming to build “the next Salesforce”.

The business has seen rapid early traction, reportedly crossing $500,000 in revenue within its first eight weeks of launch, with projections to reach $10 million in its first year.

The product is governed by a commercial, proprietary license typical of enterprise SaaS. Unlike open-source tools, Von’s “restricted” license means the underlying source code and the “context graph” technology are proprietary to Rattle Software Inc.. Users are granted a non-transferable, non-exclusive right to use the software for internal business purposes, with the company maintaining all rights, title, and interest in the service.

This philosophy of deep integration extends to the broader SaaS ecosystem, where Aggarwal observes, “Point solutions in SaaS are essentially dead. They will have a very hard time surviving in this world, because point solutions can now be white-coded within a company.”

Pricing follows a hybrid model of per-seat subscriptions and consumption-based credits. This structure is designed to scale with the persona using the tool; for instance, a Chief Revenue Officer (CRO) seat may cost $1,000 per month for deep strategic analysis, while individual seller seats may be as low as $20 per month for basic research and follow-up tasks.

The company is currently backed by several tier-one venture capital firms, including Sequoia Capital, Lightspeed, Insight Partners, and GV (Google Ventures).

Early adopter reaction

The reaction from early adopters highlights a shift in how AI is being integrated into the sales org.

Taylor Kelly, Head of Revenue Operations at Tapcart, remarked that “Von handles the analysis and insights that would normally require hiring another full-time analyst,” specifically citing its ability to handle complex Salesforce configurations and deal risk assessments.

Similarly, Evan Briere, VP of Partnerships at DemandScience, noted that Von’s direct connection to data sources makes it “actually applicable” compared to more “theoretical” horizontal AI tools like ChatGPT.

Other community feedback from the platform’s early users includes:

CJ Oordt, Sales Director at Coalesce: Described it as a “research assistant who knows every conversation and note”.
Rob Janke, Director of Revenue Operations at QuickNode: Stated that Von “solved this gap before we could even start building it ourselves”.
Sydney, Head of Renewals at 15Five: Highlighted its impact on renewal intelligence, allowing her to analyze actual conversation signals across an entire book of business in minutes.

The prevailing sentiment among these users is that Von serves as “additional headcount” rather than just a tool. This mirrors the company’s internal metrics, which report that Von is already completing over 10,000 revenue tasks per week for its customer base.

An autonomous revenue org

The introduction of Von signals a maturing of AI in the enterprise. We are moving past the era of “AI as a feature”—where a chatbot is simply bolted onto an existing CRM—toward “AI as a persona”.

By training foundational models on a company’s specific business logic, Von is attempting to create a system that doesn’t just return data but offers “judgment calls”.As organizations look toward the rest of 2026, the challenge for RevOps leaders will be one of trust and infrastructure.

If Von can maintain its claimed 95% accuracy in predicting deal outcomes, the role of the human salesperson will inevitably shift toward higher-value relationship management, leaving the “data science” of sales to the agents.

For now, Von remains a high-growth experiment in whether the “intelligence layer” can finally bring the same level of revolutionary workflow to the people who sell as it has to the people who build.

Orchestration

Train-to-Test scaling explained: How to optimize your end-to-end AI compute budget for inference

The standard guidelines for building large language models (LLMs) optimize only for training costs and ignore inference costs. This poses a challenge for real-world applications that use inference-time scaling techniques to increase the accuracy of model responses, such as drawing multiple reasoning samples from a model at deployment.

To bridge this gap, researchers at University of Wisconsin-Madison and Stanford University have introduced Train-to-Test (T²) scaling laws, a framework that jointly optimizes a model’s parameter size, its training data volume, and the number of test-time inference samples.

In practice, their approach proves that it is compute-optimal to train substantially smaller models on vastly more data than traditional rules prescribe, and then use the saved computational overhead to generate multiple repeated samples at inference.

For enterprise AI application developers who are training their own models, this research provides a proven blueprint for maximizing return on investment. It shows that AI reasoning does not necessarily require spending huge amounts on frontier models. Instead, smaller models can yield stronger performance on complex tasks while keeping per-query inference costs manageable within real-world deployment budgets.

Conflicting scaling laws

Scaling laws are an important part of developing large language models. Pretraining scaling laws dictate the best way to allocate compute during the model’s creation, while test-time scaling laws guide how to allocate compute during deployment, such as letting the model “think longer” or generating multiple reasoning samples to solve complex problems.

The problem is that these scaling laws have been developed completely independently of one another despite being fundamentally intertwined.

A model’s parameter size and training duration directly dictate both the quality and the per-query cost of its inference samples. Currently, the industry gold standard for pretraining is the Chinchilla rule, which suggests a compute-optimal ratio of roughly 20 training tokens for every model parameter.

However, creators of modern AI model families, such as Llama, Gemma, and Qwen, regularly break this rule by intentionally overtraining their smaller models on massive amounts of data.

As Nicholas Roberts, co-author of the paper, told VentureBeat, the traditional approach falters when building complex agentic workflows: “In my view, the inference stack breaks down when each individual inference call is expensive. This is the case when the models are large and you need to do a lot of repeated sampling.” Instead of relying on massive models, developers can use overtrained compact models to run this repeated sampling at a fraction of the cost.

But because training and test-time scaling laws are examined in isolation, there is no rigorous framework to calculate how much a model should be overtrained based on how many reasoning samples it will need to generate during deployment.

Consequently, there has previously been no formula that jointly optimizes model size, training data volume, and test-time inference budgets.

The reason that this framework is hard to formulate is that pretraining and test-time scaling speak two different mathematical languages. During pretraining, a model’s performance is measured using “loss,” a smooth, continuous metric that tracks prediction errors as the model learns.

At test time, developers use real-world, downstream metrics to evaluate a model’s reasoning capabilities, such as pass@k, which measures the probability that a model will produce at least one correct answer across k independent, repeated attempts.

Train-to-test scaling laws

To solve the disconnect between training and deployment, the researchers introduce Train-to-Test (T²) scaling laws. At a high level, this framework predicts a model’s reasoning performance by treating three variables as a single equation: the model’s size (N), the volume of training tokens it learns from (D), and the number of reasoning samples it generates during inference (k).

T² combines pretraining and inference budgets into one optimization formula that accounts for both the baseline cost to train the model (6ND) and the compounding cost to query it repeatedly at inference (2Nk). The researchers tried different modeling approaches: whether to model the pre-training loss or test-time performance (pass@k) as functions of N, D, and k.

The first approach takes the familiar mathematical equation used for Chinchilla scaling (which calculates a model’s prediction error, or loss) and directly modifies it by adding a new variable that accounts for the number of repeated test-time samples (k). This allows developers to see how increasing inference compute drives down the model’s overall error rate.

The second approach directly models the downstream pass@k accuracy. It tells developers the probability that their application will solve a problem given a specific compute budget.

But should enterprises use this framework for every application? Roberts clarifies that this approach is highly specialized. “I imagine that you would not see as much of a benefit for knowledge-heavy applications, such as chat models,” he said. Instead, “T² is tailored to reasoning-heavy applications such as coding, where typically you would use repeated sampling as your test-time scaling method.”

What it means for developers

To validate the T² scaling laws, the researchers built an extensive testbed of over 100 language models, ranging from 5 million to 901 million parameters. They trained 21 new, heavily overtrained checkpoints from scratch to test if their mathematical forecasts held up in reality. They then benchmarked the models across eight diverse tasks, which included real-world datasets like SciQ and OpenBookQA, alongside synthetic tasks designed to test arithmetic, spatial reasoning, and knowledge recall.

Both of their mathematical models proved that the compute-optimal frontier shifts drastically away from standard Chinchilla scaling. To maximize performance under a fixed budget, the optimal choice is a model that is significantly smaller and trained on vastly more data than the traditional 20-tokens-per-parameter rule dictates.

In their experiments, the highly overtrained small models consistently outperformed the larger, Chinchilla-optimal models across all eight evaluation tasks when test-time sampling costs were accounted for.

For developers looking to deploy these findings, the technical barrier is surprisingly low.

“Nothing fancy is required to perform test-time scaling with our current models,” Roberts said. “At deployment, developers can absolutely integrate infrastructure that makes the sampling process more efficient (e.g. KV caching if you’re using a transformer).”

KV caching helps by storing previously processed context so the model doesn’t have to re-read the initial prompt from scratch for every new reasoning sample.

However, extreme overtraining comes with practical trade-offs. While overtrained models can be notoriously stubborn and harder to fine-tune, Roberts notes that when they applied supervised fine-tuning, “while this effect was present, it was not a strong enough effect to pull the optimal model back to Chinchilla.” The compute-optimal strategy remains definitively skewed toward compact models.

Yet, teams pushing this to the absolute limit must be wary of hitting physical data limits. “Another angle is that if you take our overtraining recommendations to the extreme, you may actually run out of training data,” Roberts said, referring to the looming “data wall” where high-quality internet data is exhausted.

These experiments confirm that if an application relies on generating multiple test-time reasoning samples, aggressively overtraining a compact model is practically and mathematically the most effective way to spend an end-to-end compute budget.

To help developers get started, the research team plans to open-source their checkpoints and code soon, allowing enterprises to plug in their own data and test the scaling behavior immediately. Ultimately, this framework serves as an equalizing force in the AI industry.

This is especially crucial as the high price of frontier models can become a barrier as you scale agentic applications that rely on reasoning models.

“T² fundamentally changes who gets to build strong reasoning models,” Roberts concludes. “You might not need massive compute budgets to get state-of-the-art reasoning. Instead, you need good data and smart allocation of your training and inference budget.”

Orchestration