Governance, not gatekeeping: How SAP brings enterprise‑grade safety to AI connectivity

Presented by SAP


The enterprise software industry has undergone a fundamental shift, and vendors are adapting their approaches to better protect the customers who rely on them. For years, every global platform vendor running multi-tenant cloud infrastructure has maintained documented rate limits, usage controls, and restrictions on the use of undocumented internal interfaces.

CRM platforms impose daily API call limits per organization, enforce platform-layer limits, and maintain a strict separation between bulk data APIs and transactional REST surfaces. Productivity and collaboration suites throttle their graph APIs and redirect bulk workloads to purpose-built data access channels designed for that load. HR and workforce management platforms enforce concurrent request limits and per-session data retrieval caps. IT service management platforms enforce per-user rate limits and instance-level throttling. Hyperscalers publish per-service quotas, enforce them at the infrastructure layer, and explicitly prohibit applications from calling non-SDK or non-published interfaces.

These are not controversial measures. They are baseline hygiene for enterprise-grade software platforms operating shared infrastructure at scale. For more than a decade these measures have been in place without serious objection.

As SAP has taken responsibility for securing customers’ mission-critical workloads in the cloud, a unified API policy with clarified usage controls is not a restriction but the expression of enterprise-grade stewardship. Some have read the policy as a new restriction. The policy does not introduce new restrictions. It names and unifies controls that have existed across individual SAP products for years.

SAP is not introducing API governance as a novel concept. SAP SuccessFactors, SAP Ariba, SAP LeanIX, and several other SAP solutions have enforced documented rate limits and usage controls. SAP Notes and SAP’s documentation have also in the past defined API usage.

What the recent policy does is unify that existing practice into a single cross-portfolio standard, a step made urgent by the arrival of autonomous agentic harnesses that SAP is fully committed to enabling, but which place a categorically different performance, stability, and security load on API surfaces that were never designed for autonomous orchestration and data extraction at scale.

Custom interfaces: What SAP’s API policy does and does not restrict

Custom APIs built by customers in their own namespace for their own extensibility, integration, and migration purposes are customer-developed interfaces. If you have spent years building custom data services, custom RFCs, and ABAP interfaces to connect your SAP system to the world around it, the policy’s restriction on non-published APIs might read, on first encounter, like a demolition order. It is not. The policy’s restriction targets SAP’s own internal unreleased objects. It does not reach into the Z namespace and condemn two decades of ABAP engineering.

SAP’s Private Cloud customers are in a distinctly privileged position compared with much of the enterprise world, because they have long been able to build in their own namespace and to shape an environment they were free to modify and extend, and that freedom is not being revoked.

The policy is focused on something narrower: SAP’s own internal interfaces that were never published, never documented for customer use, and never offered as a dependable foundation for integration. Most custom code never touches these internals and will continue untouched; where it does, the risk for customers has always been present, and the policy merely names it rather than inventing it.

However, within that set there is a smaller class of interfaces that is not a matter for debate but for prohibition. ODP-RFC belongs in that class: it sits in SAP’s namespace as an internal, non-released interface that SAP explicitly classifies as “unpermitted” for customer or third-party application use as documented in SAP Note 3255746.

These are precisely the kinds of interfaces SAP will flag as prohibited in notes and automated tooling so that such usage can be identified early through tooling and guidance, rather than discovered late in deployment or operational context. Clean Core is distinct from the API Policy but points in the same direction, and it bears noting that customers did not merely accept it but asked for it repeatedly, having lived through the upgrade costs of the alternative; in the agentic era, where SAP runs mission-critical ERP as a service, both the Clean Core Recommendations and API Policy are conditions of the enterprise-grade reliability that cloud operations make possible.

How AI agents change API usage patterns in SAP systems

While some commentators have argued this policy is primarily a commercial move, the technical evidence tells a different story.

AI has changed everything about our traditional view of transactional interfaces. The APIs that enterprises have used for decades to integrate SAP systems with third-party applications are request-response interfaces built for transactional workloads. They were designed to fetch a sales order, post a goods receipt, or trigger a payment run. They were designed to be mostly called by a human-authored integration flow, at a predictable frequency, for a defined business purpose. They were not designed to have an autonomous AI orchestration harness run thousands of sequential calls against them in pursuit of semantic context about the business model encoded within. That is not a clean core integration pattern.

Much of the debate misses a core architectural distinction. A traditional integration tool reads a sales order from SAP, converts it into the format a target schema needs, and moves it on. SAP’s data model plays no role beyond being a transient interpretation step.

An AI agent does something categorically different. It does not merely retrieve a value. It reads the sales order header data and learns that this structure represents a customer commitment to buy. It reads the line item data and learns how individual items relate to that order. It reads the net value and learns that this number is meaningful only when paired with the document currency. It traces the path that a sales order takes through delivery, billing, and finally into the accounting ledger, and internalizes how SAP reconciles operations and finance within its business object model.

The agent is not only consuming a customer’s transactional data. It is consuming the semantic ontology: the business object definitions, the relationships between entities, the conceptual architecture that SAP has built and refined over five decades of enterprise knowledge encoding.

SAP has long distinguished between enabling transactional access to customer data and the broader extraction or replication of the underlying ontology. The policy does not create this boundary, because it already existed. Autonomous agents must continue to respect that boundary, rather than redefine it.

Security risks in third-party MCP implementations

Then there is a security angle, and it is not abstract. The same week this policy was published, a supply chain attack named the Mini Shai-Hulud – a variant of the npm worm, quietly compromised hundreds of software packages. SAP-ecosystem npm packages were compromised and we addressed this with this security note for customers. This is not a theoretical threat model. This is the active threat environment in which community-built MCP servers are being connected to productive SAP systems running mission-critical business processes.

The OWASP MCP Top 10 documents the vulnerability classes systematically: tool poisoning, prompt injection, privilege escalation via scope creep, token mismanagement, and supply chain compromise. Recent research across thousands of analyzed MCP implementations shows that a majority operate with static long-lived credentials or carry identifiable security findings, and a single compromised package in the MCP ecosystem can cascade into hundreds of thousands of exposed development environments. VentureBeat just last week reported a serious com.mand execution flaw that made up to 200,000 MCP servers vulnerable.

Consider what that means in practice. An AI agent that has just internalized the semantic structure of your SAP data model and is operating through a community MCP server, moves beyond a productivity tool and into an elevated risk category, one that combines broad system access with an attack surface that is still evolving.

Why MCP alone cannot run SAP business processes

The MCP debate has also obscured a technical reality that enterprise architects need to confront directly. The Model Context Protocol is plumbing. It specifies how an AI model calls a tool. It says nothing about whether the model understands what the tool does in a business context, in what sequence tools must be called, what side effects a given API invocation will trigger, or what the consequences of an incorrect parameter will be. A naive MCP implementation connecting to SAP OData services can call a tool. It cannot run a business process.

The token consumption data from production agentic deployments is instructive. For illustration, a query asking for an employee’s manager and traversing through the list of peers in an SAP SuccessFactors system consumed 565,000 tokens under a standard MCP implementation. The same query under a context-aware implementation consumed 80,000 tokens. That is the difference between a query costing $1.70 and a query costing $.24, for example, on a single operation, repeated across thousands of daily transactions. The standard MCP implementation is not automation. It is an expensive approximation of automation that fails on complex queries while loading the API surface with traffic it was not designed to carry.

SAP’s architecture for open third-party AI integration via A2A

SAP’s response to these challenges is not to close the ecosystem but to build the right infrastructure for an open one. That distinction is worth dwelling on.

The API Policy anchors compliance in documented, co-engineered architectures. The agentic interoperability reference architectures jointly developed with major technology partners are published and available on the SAP Architecture Center, prioritized by customer demand and updated as new patterns are validated.

The bi-directional integration of SAP Joule and Microsoft 365 Copilot is the most visible example of what co-engineered agentic integration looks like in production: two AI systems, from two different vendors, working across each other’s application surfaces without either party bypassing the other’s security model. The endorsed path for external AI agent access to SAP is the Agent Gateway via the A2A protocol, with reference AI Golden Path on the SAP Architecture Center. The SAP Knowledge Graph, Open Resource Discovery (ORD) specification for metadata, and SAP BDC data products provide the context layer that transforms a protocol connection into a business-capable interaction. SAP also offers governed MCP servers for CAP, UI5, Fiori Elements, and has indicated its intent to extent this model to additional development environments, including ABAP development. These are not closed doors, they are the right doors.

SAP’s position in the standards community is that of an active contributor, not a gatekeeper. SAP is a launch partner of the Agent2Agent (A2A) protocol under the Linux Foundation and holds Gold level membership in the Agentic AI Foundation, co-chairing the Agent Identity and Trust workstream alongside the organizations that define how AI agents authenticate, authorize, and interoperate across enterprise boundaries.

A2A and MCP are not external constraints that SAP is grudgingly accommodating. They are protocols SAP uses internally and is actively hardening through standards work. When community and open-source frameworks meet the security floor that enterprise deployment requires, external integration pathways will follow.

The API Policy issued by SAP does not mark the end of openness. The industry has spent two years deploying AI agents against enterprise systems using protocols that the enterprise security community had not finished hardening, against APIs that were never designed for autonomous orchestration, with community tooling that documented attackers had already learned to compromise. Governance was not optional, it was timely.

Anirban Majumdar is Head of the Office of the CTO at SAP.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Anthropic introduces “dreaming,” a system that lets AI agents learn from their own mistakes

Anthropic on Tuesday unveiled a suite of updates to its Claude Managed Agents platform at its second annual Code with Claude developer conference in San Francisco, introducing a new capability called “dreaming” that lets AI agents learn from their own past sessions and improve over time — a step toward the kind of self-correcting, self-improving AI systems that enterprises have demanded before trusting agents with production workloads.

The company also moved two previously experimental features — outcomes and multi-agent orchestration — from research preview into public beta, making them broadly available to developers building on the Claude platform. Together, the three features address what Anthropic says are the hardest problems in running AI agents at scale: keeping them accurate, helping them learn, and preventing them from becoming bottlenecks on complex, multi-step work.

Early adopters are already reporting significant results. Legal AI company Harvey saw task completion rates increase roughly 6x after implementing dreaming. Medical document review company Wisedocs cut its document review time by 50% using outcomes. And Netflix is now processing logs from hundreds of builds simultaneously using multi-agent orchestration.

The announcements come at a moment of extraordinary momentum for Anthropic. CEO Dario Amodei disclosed during a fireside chat at the conference that the company’s growth has outpaced even its own aggressive internal projections.

In the first quarter of 2026, Anthropic saw what Amodei described as 80x annualized growth in revenue and usage — far exceeding the 10x annual growth the company had planned for. API volume on the Claude platform is up nearly 70x year over year, and the average developer using Claude Code now spends 20 hours per week working with the tool.

“We tried to plan very well for a world of 10x growth per year,” Amodei said. “And yet we saw 80x. And so that is the reason we have had difficulties with compute.”

How Anthropic’s dreaming feature teaches AI agents to learn from their own history

Dreaming is the most novel of the three features and the one Anthropic is most eager to distinguish from conventional memory systems. While the company launched agent memory earlier this year — allowing Claude to retain preferences and context within and across individual sessions — dreaming works at a higher level of abstraction. It is a scheduled process that reviews an agent’s past sessions and memory stores, extracts patterns across them, and curates those memories so agents improve over time. It surfaces insights that no single agent session could see on its own: recurring mistakes, workflows that multiple agents converge on independently, and preferences shared across a team of agents.

Alex Albert, who leads research product management at Anthropic, explained the concept in an interview at the conference. He described dreaming as analogous to how people within organizations create skills after working through a task. “They might do a workflow with Claude, and at the end of that workflow, after they’ve iterated and zigzagged a little bit, they want to record that path from A to B,” Albert said. “A very similar thing is happening with dreaming — instead of you manually creating the skill from your experience working with Claude, the model is doing it, so it has that same context for a future session.”

Crucially, dreaming does not modify the underlying model weights. “We’re not changing the model itself through dreaming — it’s not doing updates to the weights or anything like that,” Albert said. Instead, the agent writes learnings as plain-text notes and structured “playbooks” that future sessions can reference, making the entire process observable and auditable by humans. When asked about the trust implications of agents consolidating their own knowledge, Albert acknowledged that “there is a level of trust that you need to place” but noted that all memories are inspectable and that smarter models are getting progressively better at managing this process. “They’re learning to write better notes for their future self,” he said.

A live demo showed AI agents improving overnight without human guidance

During the keynote, the Anthropic team demonstrated all three features live on stage using a fictional aerospace startup called “Lumara” that needed to autonomously land drones on the moon for resource mining. The team configured a multi-agent system with three specialists — a commander agent responsible for overall mission success, a detector agent that identified high-quality landing sites, and a navigator agent that handled safe drone flight and landing — and defined a success rubric requiring soft landings, clear ground, and enough fuel reserves for a return trip to Earth.

An initial simulation across six hypothetical landing sites produced strong but imperfect results. To improve, the presenters triggered a dreaming session directly from the Claude Developer Console. Overnight, the dreaming agent reviewed all past simulation sessions and wrote a detailed descent playbook — a comprehensive set of heuristics drawn from patterns across multiple mission runs. When the team ran a new simulation the following morning with the dreaming-derived playbook in memory, the results improved meaningfully on the sites that had previously underperformed.

“All we had to do was just have Caitlin press a button,” said Angela Jiang, Head of Product for the Claude Platform, referring to her colleague on stage. “All dreaming.”

The demo illustrated how the three features compose together in practice. Multi-agent orchestration split the complex task across specialists with independent context windows. Outcomes provided the rubric against which a separate grader agent evaluated each run. And dreaming extracted lessons across those runs to improve future performance — forming what Anthropic describes as a continuous improvement loop that requires no human intervention between iterations.

Why Anthropic built a separate ‘grader’ agent to check Claude’s own work

The outcomes feature, now in public beta, gives developers a way to define what success looks like using a rubric — a structural framework, a presentation standard, a brand voice, or any other set of criteria — and then lets the agent iterate toward that standard autonomously. What makes outcomes architecturally distinctive is its separation of concerns. When an agent completes its work, a separate grader agent evaluates the output against the developer-defined rubric in its own independent context window. Because the grader operates in a fresh context, it is not influenced by the working agent’s reasoning or accumulated biases from the session.

When the grader identifies gaps between the output and the rubric, it pinpoints specifically what needs to change, and the working agent takes another pass. This loop continues until the rubric criteria are met — without a human needing to review each attempt.

Albert described Anthropic’s broader verification strategy as employing “more test time compute, more models thinking about a problem for longer, to check over the work of another.” He acknowledged that having a model check its own work raises reasonable questions, but said a fresh context window reviewing completed work consistently outperforms asking the same long-running thread to identify its own bugs. “You will get higher success if you give that output to a fresh Claude and say, ‘what bugs do you see?'” he said. “There is still something to the attention” that degrades over very long sessions — a limitation he said Anthropic is actively working to fix in future models.

The approach mirrors strategies already in use at GitHub. Mario Rodriguez, Chief Product Officer at GitHub, described during a separate talk at the conference how Copilot uses a similar advisor pattern with Claude models — pairing a smaller, cheaper model as an executor with a larger model as a mentor. When the smaller model encounters a problem beyond its capability, it calls the larger model for guidance, then continues executing on its own. Rodriguez said the approach delivers near-Opus-level intelligence at significantly lower cost, and that GitHub inserts critique models at three specific points in the coding workflow: after drafting a plan, after a complex implementation, and after writing tests but before running them.

Parallel AI agents can now tackle tasks too complex for a single model thread

Multi-agent orchestration, the third feature moving to public beta, allows a lead agent to decompose a large task into subtasks and delegate each one to a specialist agent — each with its own model, system prompt, tools, and independent context window. Every step in the process is traceable in the Claude Console, showing which agent did what, in what order, and why.

The design gives each sub-agent an isolated context, which Anthropic says produces better results than having a single agent attempt to hold all the complexity in one thread. “Each sub-agent has its own independent thread and context window,” the keynote presenters explained. “This is very intentional — we found that by splitting the work and then merging the results, we get better outcomes.”

Albert offered his own heuristic for when multi-agent architectures make sense versus sticking with a single thread. “Parallel agents are better for investigation,” he said — situations where there is a lot of context that will ultimately be discarded. “If you’re trying to answer a specific question, you don’t need all the search results from the areas where it didn’t find the answer. You just need the answer.” He described spinning up disposable sub-agents for specific retrieval tasks and bringing only the result back to the main thread. Increasingly, he said, the model itself will decide when to parallelize. “In the future, you won’t really care if it’s one agent or multi-agent or whatever’s happening. You just have a Claude that you’re talking to, and it will deploy the right architecture automatically.”

Anthropic’s bigger bet: closing the gap between AI capabilities and real-world adoption

The three features arrive as part of a broader platform push that Anthropic framed throughout the conference as closing “the gap between what AI can do and what it’s actually doing for people.” Ami Vora, Anthropic’s Chief Product Officer, set the theme in her opening keynote, noting that while model capabilities are advancing on an exponential curve, most organizations are still adopting AI on a linear path.

Dianne Penn, who leads product for Anthropic’s research team, described the company’s measure of progress as “task horizon” — how long an AI agent can work autonomously while improving the quality of its deliverables. “This time last year, models could work for minutes,” she said. “Now, most of us have agents running for hours on end. Tomorrow, we’ll have agents that are proactive, always on, and know what to work on without losing the frame.”

The event also included several infrastructure announcements designed to help developers keep pace. Anthropic said it is doubling its five-hour rate limits for Pro, Max, Team, and Enterprise plans, and raising API rate limits considerably. The company announced a partnership with SpaceX to use the full capacity of its Colossus data center to expand compute availability — a direct response to the demand crunch Amodei described.

All three features are built into Claude Managed Agents, which launched in public beta on April 8 as an opinionated harness that bundles best practices including memory, tool integration, and action handling. Anthropic says teams using Managed Agents have shipped 10x faster than those building their own agent infrastructure from scratch. Albert described the platform using an operating system analogy: “With managed agents, you don’t need to think about all the technicalities of how you set up the surrounding system,” he said. “You’re building an application for Macs — you don’t want to go have to re-implement every detail of macOS.”

What dreaming, outcomes, and multi-agent orchestration mean for the future of enterprise AI

The competitive implications are significant. As AI agent platforms from OpenAI, Google, and others compete for developer adoption, Anthropic is betting that production reliability — not just raw model intelligence — will determine which platform wins enterprise budgets. The dreaming feature in particular stakes out new territory: while other platforms offer memory and tool use, the idea of agents systematically reviewing their own histories to extract reusable knowledge goes further toward the kind of continuously improving systems that enterprises need before delegating high-stakes work.

The conference showcased companies already operating at that scale. Mercado Libre, Latin America’s largest e-commerce platform, has 23,000 engineers running Claude Code, has reviewed more than 500,000 pull requests with human oversight, and is aiming for 90% autonomous coding by the third quarter of this year. Shopify has deployed Claude Code across not just engineering but design, product, and data science teams.

But it was Dario Amodei who articulated the most expansive vision for where all of this leads. He described a progression from single agents to multiple agents to whole organizational intelligence — from “a team of smart people in a room” to what he called “a country of geniuses in the data center.” And he reiterated a prediction he made roughly a year ago: that 2026 would see the first billion-dollar company run by a single person. “Hasn’t quite happened yet,” he said. “But we’ve got seven more months.”

Dreaming is available now in research preview. Outcomes and multi-agent orchestration are in public beta and available to all developers on the Claude platform. Whether seven months is enough time for a solo founder to build a billion-dollar business remains an open question — but after Tuesday, they have a few more tools to try.

How Sakana trained a 7B model to orchestrate GPT-5, Claude Sonnet 4 and Gemini 2.5 Pro

Every LangChain pipeline your team hardcodes starts breaking the moment the query distribution shifts — and it always shifts. That bottleneck is what Sakana AI set out to eliminate.

Researchers at Sakana AI have introduced the “RL Conductor,” a small language model trained via reinforcement learning to automatically orchestrate a diverse pool of worker LLMs. Conductor dynamically analyzes inputs, distributes labor among workers, and coordinates among agents.

This automated coordination achieves state-of-the-art results on difficult reasoning and coding benchmarks, outperforming individual frontier models like GPT-5 and Claude Sonnet 4 as well as expensive human-designed multi-agent pipelines. It achieves this performance at a fraction of the cost and with fewer API calls than competitors. RL Conductor is the backbone of Fugu, Sakana AI’s commercial multi-agent orchestration service.

The limitations of manual agentic frameworks

Large language models have strong latent capabilities. But tapping these capabilities to their fullest is a great challenge. Extracting this level of performance relies heavily on manually designed agentic workflows, which serve as critical components in commercial AI products. 

However, these frameworks fall short because they are inherently rigid and constrained. In comments to VentureBeat, Yujin Tang, co-author of the paper, explained the exact breaking point of current systems: “While using frameworks with hard-coded pipelines like LangChain and Mixture-of-Agents can work well for specific use cases … In production, an inherent bottleneck arises when targeting domains with large user bases with very heterogeneous demands.” 

Tang noted that achieving “real-world generalization in such heterogeneous applications inherently necessitates going beyond human-hardcoded designs.”

Another bottleneck for building robust agentic systems is that no single model is optimal for all tasks. Different models are fine-tuned to specialize in distinct domains. One model might excel at scientific reasoning, while another is superior at code generation, mathematical logic, or high-level planning. 

Because models have these varying characteristics and complementary skills, manually predicting and hard-coding the ideal combination of models for every query is practically impossible. An optimal agentic framework should be able to analyze a problem and delegate subtasks to the most suitable expert in the pool.

Conducting an orchestra of agents

The RL Conductor is designed to overcome the limitations of rigid, human-designed frameworks. As the name implies, it conducts an orchestra of agents by dividing challenging problems, delegating targeted subtasks, and designing communication topologies for a set of worker LLMs. 

Instead of relying on fixed code or static routing, the Conductor orchestrates these models by generating a customized workflow. For each step in the workflow, the model generates a natural language instruction for a specific aspect of the task, assigns an agent to carry it out, and defines an “access list” that dictates which past subtasks and responses from other agents are included in that agent’s context.

By defining everything in natural language, the Conductor builds flexible workflows tailored to each input. It can construct simple sequential chains, parallel tree structures, or even recursive loops depending on the problem’s demands. 

Importantly, the model learns these strategies not by human design but through reinforcement learning (RL) and reward maximization. During training, the model is given a task, a pool of workers, and a reward signal based on whether its answer and output format are correct.

Through a simple trial-and-error RL algorithm, the model organically discovers which combinations of instructions and communication structures yield the highest reward. As a result, it automatically adopts advanced orchestration strategies such as targeted prompt engineering, iterative refinement, and meta-prompt optimization. 

The model learns to dynamically adjust its strategies and leverage the distinct strengths of its worker agents without any human developer having to hard-code the process.

Conductor in action

To test RL Conductor in action, the researchers fine-tuned the 7-billion parameter Qwen2.5-7B using the framework. During training, the Conductor was tasked with designing agentic workflows of up to five steps. It was given access to a worker pool containing seven different models: three closed-source giants (Gemini 2.5 Pro, Claude-Sonnet-4, and GPT-5) and four open-source models (including DeepSeek-R1-Distill-Qwen-32B, Gemma3-27B, and Qwen3-32B).

The team evaluated the Conductor across a variety of highly challenging benchmarks, comparing it against individual frontier models acting alone, self-reflection agents prompted iteratively to improve their own answers, and state-of-the-art multi-agent routing frameworks like MASRouter, Mixture-of-Agents (MoA), RouterDC, and Smoothie. The small 7B Conductor set new benchmarks across the board. It achieved an average score of 77.27% across all tasks, hitting 93.3% on the AIME25 math benchmark, 87.5% on GPQA-Diamond, and 83.93% on LiveCodeBench, according to the researchers.

Remarkably, it achieved these marks while remaining highly efficient. While baseline models like MoA burned through 11,203 tokens per question, the Conductor used an average of just 1,820 tokens, taking an average of only three steps per workflow.

A closer look at the experimental details shows exactly why the framework is so effective. The Conductor automatically learned to measure task difficulty. For simple factual recall questions, it often solved the problem in a single step or used a basic two-agent setup. However, for complex coding problems, it built extensive workflows involving up to four agents with dedicated planning, implementation, and verification phases.

The Conductor also learned that frontier models have different strengths. To achieve record scores on coding benchmarks, the Conductor frequently assigned Gemini 2.5 Pro and Claude Sonnet 4 to act as high-level planners, and only brought in GPT-5 at the very end to write the final optimized code. In a particularly clever display of adaptability, the Conductor would sometimes completely abdicate its own role, handing the entire planning process over to Gemini 2.5 Pro and allowing it to dictate the subtasks for the rest of the pool.

Beyond math and coding benchmarks, Sakana AI is already putting the underlying architecture to work in front-office utility. “We have been using our Fugu models based on the Conductor technology internally for various practical enterprise applications: software development, deep research, strategy development, and even visual tasks like slide generation,” Tang said.

Bringing orchestration to the enterprise: Sakana Fugu

While the 7B model described in the research paper was an exploratory blueprint and is not publicly available, Sakana AI has productized the Conductor framework into its flagship commercial AI product, Sakana Fugu. Now in its beta phase, Fugu serves as a multi-agent orchestration system accessible through a standard OpenAI-compatible API.

Tang noted Fugu targets “the large market of industries where AI adoption has yet to bring large productivity gains due to the generalization limitations of current hard-coded pipelines, such as finance and defense.”

For enterprise developers, this allows seamless integration into existing applications without the headache of managing multiple API keys or manually routing tasks across different vendors. Behind the API interface, Fugu automates complex collaboration topologies and role assignments across a pool of models. To support varying business needs, Sakana released two variants: Fugu Mini, built for low-latency operations, and Fugu Ultra, designed for maximum performance on demanding workloads.

Addressing governance concerns around autonomous agents spinning up invisible workflows, Tang pointed out that the interpretability risks are functionally similar to the hidden reasoning traces of current top-tier closed APIs, and the system is managed with established guardrails to minimize hallucinations. 

For enterprise architects weighing when to deploy RL-orchestration versus traditional routing, the decision often comes down to engineering resources. “We believe the absolute sweet spot comes whenever users and their teams feel they are spending a disproportionate amount of time guiding their underlying agents,” Tang said. However, he cautioned that the framework isn’t necessary for everything, noting that “it’s hard to beat the economic proposition of a local model running directly on the user’s machine for simple queries.”

As the diversity of specialized open- and closed-source AI models continues to grow, static hardcoded pipelines will inevitably become obsolete. Looking ahead, this dynamic orchestration will likely extend beyond text and code environments. “There is indeed a large potential to fill this gap with cross-modal Conductor frameworks becoming the foundation for more autonomous, self-coordinating physical AI systems,” Tang said.

Why AI breaks without context — and how to fix it

Presented by Zeta Global


The gap between what AI promises and what it delivers is not subtle. The same model can produce precise, useful output in one system and generic, irrelevant results in another.

The issue is not the model. It’s the context.

Most enterprise systems were not built for how AI operates. Data is scattered across tools. Identity is inconsistent. Signals arrive late or not at all. Systems record events but fail to connect them into a continuous view.

AI depends on that continuity. Without it, the model fills in the gaps so the result looks polished but lacks relevance. This is where most teams get stuck.

A better model does not fix fragmented, stale, or commoditized data. Gartner estimates organizations lose an average of $12.9 million annually due to poor data quality. AI does not solve that problem, it surfaces it faster and at a greater scale.

The mirror test

There is a fast diagnostic test for this. Give your AI a perfect, high-intent customer signal and see what comes back. If the output is generic or irrelevant, the model needs work. But if the model produces something sharp and useful on clean data, and then falls apart on real production data, the problem is the data.

In practice, it is almost always the second scenario. AI functions like a magnifying glass, so strong data systems become dramatically more powerful, and the weak ones become dramatically more visible. Organizations that have been coasting on fragmented, poorly integrated customer data can no longer hide behind reporting lag and manual interpretation. The AI renders the problem in plain sight.

Context is the new identity layer

This is really where the next evolution gets interesting. Even after you solve the data quality problem, there is still a second shift underway in how customer profiles are built and used.

For years, enterprise data systems stored content: transactions in CRMs, demographics in data warehouses, campaign responses in marketing platforms. These records described what had already happened. They were useful for reporting but were not built for AI.

AI requires context. Context is not a static record. It is a current view of the customer including recent behavior, cross-channel signals, and emerging intent. The thread that connects one interaction to the next. Identity tells you who someone is. Context tells you what they are doing and what they are likely to do next.

Consider a simple example: ask an AI to recommend a beach vacation destination, and it might suggest Hawaii or Florida. Tell it you have three children, and it surfaces family-friendly options. Give it access to your recent search patterns, your affordability signals, and where you have been searching over the past year, and the recommendation changes entirely because the model is no longer working from demographic categories but from a live picture of who you are and what you are doing right now.

Most enterprise systems were built to store state, not maintain context. They capture events, but they don’t maintain continuity between them.

That’s the gap AI exposes.

But for practitioners, the challenge is not conceptual; it is architectural. Context does not live in a single system. It is fragmented across event streams, product analytics tools, CRMs, data warehouses, and real-time pipelines. Stitching that into something an AI system can actually use requires moving from batch-oriented data models to streaming or near-real-time architectures, where signals are continuously ingested, resolved, and made available at inference time.

This is where many AI initiatives stall. The model is ready, but the context layer is not operationalized. Systems are not designed to retrieve the right signals within milliseconds, or to resolve identity across channels in real time. Without that, “context” remains theoretical rather than actionable.

Architectures like Model Context Protocol (MCP) are accelerating this shift by giving AI systems a way to pass memory about a user between applications, essentially threading a continuous line of context around an individual across different interactions. The result is a profile that becomes richer and more predictive over time, one that creates a line of continuity between what someone has done, what they are doing now, and what they are likely to do next.

When that identity layer is strong, the same model produces better outcomes. When it is weak, no model can compensate.

The compounding advantage

Organizations that built first-party data systems and durable identity infrastructure before the AI wave are now benefiting from a compounding effect. Better data trains smarter models. Smarter models attract more consented users. More consented users generate richer behavioral signals.

Competitors without that foundation cannot replicate this, regardless of which model they are running. The gap is structural, not algorithmic, and because identity systems improve incrementally over time, the organizations that started investing earlier have advantages that are genuinely hard to close.

What this means in practice

The practical implication is a shift in where AI investment goes. The organizations getting consistent results from AI are treating it as a processing layer for a living data system, not as a standalone capability to be bolted onto existing infrastructure.

For builders and operators, this translates into a different set of priorities than the last two years of AI experimentation:

First, instrument for real-time signals. Batch pipelines and nightly refreshes are not sufficient when AI systems are expected to respond to user intent as it happens. Teams need event-driven architectures that capture and surface behavioral signals in near real time.

Second, make context retrievable at inference time. It is not enough to store data in a warehouse. Systems must be designed so that relevant context can be resolved and injected into prompts or retrieved by agents within milliseconds.

Third, invest in identity resolution as infrastructure. Connecting fragmented signals across devices and channels so the system understands real individuals rather than anonymous interactions is foundational, not optional.

Fourth, treat governance and consent as part of system design. First-party data built on trust is not just safer; it is more durable and ultimately more valuable than third-party data that competitors can access.

These investments are less visible than a new model launch and are also far harder to copy.

The real race

Models are now interchangeable. The difference will come from who can operationalize context at scale and treat the model as a processing layer, not the advantage.

That advantage comes from years of investment in identity infrastructure, first-party data, and systems that keep customer context current.

The organizations that win won’t be the ones with better prompts. They’ll be the ones whose systems understand the customer before the prompt is ever written.

Neej Gore is Chief Data Officer at Zeta Global.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Scaling AI into production is forcing a rethink of enterprise infrastructure

Presented by NutanixAcross industries, organizations are focused on how to move from AI pilots, proofs of concept, and cloud-based experimentation to deploying it at scale — across real workloads, for real users, in real business environments. VentureB…

GPT-5.5 Instant shows you what it remembered — just not all of it

OpenAI updated the default model for ChatGPT to its new GPT-5.5 Instant, along with a new memory capability that finally shows which context shaped responses — at least some of them. 

This limitation signals that models are starting to create a second, incomplete memory observability layer that could conflict with existing audit systems and agent logs. 

GPT-5.5 Instant replaces GPT-5.3 Instant as the default ChatGPT model and is a version of its new flagship GPT-5.5 LLM. It’s supposed to be more dependable, accurate and smarter than 5.3. 

But it’s the introduction of memory sources, which will be enabled across all models in the platform, that could help enterprises in their projects. 

“When a response is personalized, you can see what context was used, such as saved memories or past chats, and delete or correct it if something is outdated or no longer relevant,” OpenAI said in a blog post

When a user asks ChatGPT something, users can tap the sources button (at the bottom of the response) to see which files or past chats the model tapped to find the answer. Users also have full control over the sources models can cite, and these sources will not be shared if the conversation is sent to others. 

The company said memory sources should make it easier to personalize model responses. Still, OpenAI admitted that the models “may not show every factor that shaped an answer” and promised to make the capability more comprehensive over time. 

What this means is that memory sources offer a semblance of observability in ChatGPT answers, but not full auditability yet. 

Competing memory systems 

Enterprises have a system in place to solve part of the memory and context problem with models and agents.
Models are exposed to context through retrieval-augmented generation (RAG) pipelines; whatever the agent fetches from the vector databases is logged, and the agent’s state is stored in a memory layer. All of this is tracked in application logs, usually in an orchestration or management layer with built-in observability. Ideally, this allows teams to trace failure back through the stack.

The current system is imperfect; sometimes, it’s not easy to trace failure points, but it’s at least internally consistent. For enterprises using ChatGPT, whether the default GPT-5.5 Instant or their model of choice, that’s no longer the case.

The model surfaces its own version with memory sources that are wholly separate from existing retrieval logs — in short, a model-reported context. A problem arises if these cannot be reconciled reliably. And because memory sources only give users part of the picture — it’s unclear what ChatGPT’s limit on citing memory sources is — it becomes even harder to match what GPT-5.5 Instant said it tapped to what it actually did in the production environment.

This situation creates a new failure mode: A competing context log. If something seems wrong, it can create inconsistencies that enterprises have to deal with.

Malcolm Harkins, chief trust and security officer at HiddenLayer, told VentureBeat that memory sources “look like a pragmatic middle ground ” in offering some transparency, but it’s still not easy to see its value.

“For enterprises, it’s directionally useful but insufficient on its own,” Harkins said. “Real value will depend on how it integrates with security, governance, access controls and audit systems.”

A more capable default model 

However, GPT-5.5 Instant handles memory, and OpenAI calls it an improvement over GPT-5.3 Instant. 

Internal evaluations showed GPT-5.5 Instant returned 52.5% fewer hallucinated claims than the previous default model, especially for high-stakes domains such as medicine, law, and finance. Inaccurate claims fell by 37.3% on challenging conversations. The company said the model improved on photo analysis and image uploads, answering STEM questions and knowing when to tap its own knowledge base or use web search. 

Peter Gostev, AI capability at independent model evaluator Arena, explained to VentureBeat in an email that the key result to watch about GPT-5.5 Instant is how it performs on the overall text rankings, especially because its predecessor did not have a strong showing. 

“Since GPT-4o, the strongest-performing OpenAI chat model on the Arena has been GPT-5.2-Chat, which still ranks 12th on the Overall Text Arena months after release,” Gostev said. Notably, users preferred it even over the higher-reasoning GPT-5.2-High variant, which is currently ranked 52nd on the Arena. “By comparison, GPT-5.3-Chat, the previous default model in ChatGPT, was significantly less competitive, ranking 44th overall, 32 places below GPT-5.2-Chat.”

What enterprises need to do about memory sources

Organizations that rely on ChatGPT for some tasks will need to formalize how memory works for their stack. Memory sources are not limited to GPT-5.5 Instant; it is enabled for all models on the ChatGPT platform. 

To address the problem of competing memory sources, enterprises have to audit their memory management. Model-reported context could overlap or contradict these logs, so it’s best to define a clear source of truth. In the event of a failure, administrators know which log to believe. 

It would also be a good idea to decide whether or not to expose memory sources to users. ChatGPT only shows a select number of chats or files it used to complete a request. Some users may find more transparency trustworthy. 

Ultimately, the number one thing for enterprises to remember about memory sources is that what the model reports as its context is not the full picture for auditing. It’s a form of observability, but it cannot withstand a full examination. 

Inside AMEX’s agentic commerce stack: How intent contracts and single-use tokens enforce AI transactions

American Express (Amex) is building a system that lets AI agents shop and pay on behalf of users — but right now it’s only within its own payment network, and still involves a black box that could hinder trust and auditability.

Amex already participates in agentic commerce protocol projects, especially Google’s Agent Pay Protocol (AP2), which focuses on interoperability. Amex’s Agentic Commerce Experiences (ACE) developer kit, on the other hand, touches on something most protocols currently lack: Full transaction control in the payment layer. 

But it still isn’t completely transparent in how it handles validation. ACE uses a closed-loop system — serving as both the card issuer and the payment network — to validate agent-led transactions. 

Luke Gebb, Amex’s EVP and global head of innovation, told VentureBeat that the company believes this model is the missing piece in agentic commerce.  

“Some of what is missing so far is the perspective of a company like ours: We feel that trust and security are critical to advancing this space,” Gebb said. “This is really the first time that an issuer is coming to the table.”

Amex sits in that interesting space: Unlike other financial institutions or card providers like Chase or Bank of America, Amex can route transactions through its American Express Network. Visa and Mastercard are two of the most well-known payment networks, but these companies don’t issue cards themselves and must work with a bank.

The continued black box of agentic commerce 

The ACE kit is just one approach to addressing some of agentic commerce’s biggest problems: trust, control, accountability, validation, and security. 

Consumers generally don’t want rogue agents to run away with their bank accounts and start buying things. Merchants don’t want to be stuck with unpaid items. Banks don’t want to deal with an influx of chargebacks and the potential for fraud. 

Projects like the ACE kit aim to build trust and accountability by verifying an agent’s identity and goals. This can build the trust agentic commerce desperately needs.

Amex claims it offers validation, too, although the process behind that is unclear. It is abstracting how it performs validation, even though it explains at which layer it does it. More traditional systems feature a mix of deterministic checks and a flexible, semantic evaluation that helps match intent and outcome for validation. Amex said agents built with ACE can submit user shopping carts and check them against the agent’s original intent. However, they did not disclose how this works.

Practitioners building to the agentic commerce ecosystem lament that, despite strides in creating a trust layer, many black boxes remain that could hinder widespread adoption.

Raj Ananthanpillai, founder and CEO of identity and verification system provider Trua, told VentureBeat that payment protocols and software kits like Agentic Commerce Suite from Stripe, Google’s Verifiable Intent proof chain, and the ACE developer kit “excel at handling proofs, verifiable authorizations and the mechanics of fund movement, but leave upstream human validation opaque and underdeveloped.”

Ananthanpillai continued: “Without a clear, high-assurance cryptographic link proving that an agent is acting under the explicit authority of a verified human owner, merchants, issuers, and networks face heightened risks of repudiation, massive chargebacks, sanctioned people conducting financial transactions, and fraud.”

The ACE kit

The ACE developer kit solves several running issues with agentic commerce, Gebb said, and gives developers access to integrated services:

  • Agent registration

  • Account enablement

  • Intent intelligence

  • Payment credentials 

  • Cart context

First, it deals with agent registration, establishing identity and trust with both the consumer and company agents. When a transaction begins, the agent acting on behalf of the customer and the merchant’s agent can verify each other’s identities and trust that they are dealing with the correct entity. 

Next comes account enablement, which links the user’s Amex account to their agent and grants the agent permission to act, or, in the case of agentic commerce, buy something.

Intent intelligence creates what Amex calls an intent contract, where the user defines what they want the agent to do. Once the intent is defined, the ACE system generates an Intent ID and a Proof of Intent Token that definitively proves authorization in the event of a dispute.

Amex handles the actual transaction part, where the user pays for the product through a single-use token. ACE establishes payment credentials used for the transaction, bound to intent and constraints. 

“Once the agent has found the item that the customer has asked for, like red shoes, they’ll make a call for the payment credentials, which is a token that has the boundaries that the card member has provided,” Gebb said. “So, for instance, if they said they only wanted to spend $500, that token won’t allow for a purchase of $600 because it has controls built in.”

The last piece is cart context and validation, which Gebb said helps banks and brands compare a user’s cart that their agent submitted to their intent. 

Amex’s approach shows that for agentic commerce to really soar, providers must understand what systems will allow agents to do and who is ultimately accountable if something goes wrong. 

Salesforce launches Agentforce Operations to fix the workflows breaking enterprise AI

Enterprise AI teams are hitting a wall — not because their models can’t reason, but because the workflows underneath them were never built for agents. Tasks fail, handoffs break, and the problem compounds as organizations push agents deeper into back-office systems. A new architectural layer is emerging to address it: workflow execution control planes that impose deterministic structure on processes agents are expected to run.

One of the companies bringing this to the forefront is Salesforce, with a new workflow platform that turns back-office workflows into a set of tasks for specialized agents to complete. Users can upload their processes or use one of the set Blueprints provided by Salesforce, and Agentforce Operations will break it down for agents. 

Salesforce senior vice president of Product, Sanjna Parulekar, told VentureBeat in an interview that the problem is that many enterprise workflows are not built for agents. “What we’ve observed with customers is that a lot of times, the brokenness in a process is probably in your product requirements document,” Parulekar said. “So when that’s uploaded into a product, it doesn’t quite work. We can optimize it and cut out some things and replace it with an agent.”

Without this control panel layer, enterprises could risk deploying agents that increase cost rather than fix their workflow problems.

Making the workflow work for agents, not just humans

Enterprises deploying agents are learning a costly lesson: Their workflows were designed around human judgment gaps, not machine execution. Processes that evolved through years of workarounds — loosely defined steps, implicit decisions, coordination that depends on individuals knowing what to do next — break when agents are asked to follow them literally.

Even with all of an enterprise’s context at its fingertips, AI systems will have difficulty completing tasks if it is not clear what it’s supposed to do. 

Parulekar said her team found that focusing on what makes the process tick and breaking it down into more explicit steps and workflows makes the system more deterministic. Then, when platforms like Agentforce Operations introduce agents, those agents already know their specific tasks.  

“It forces companies to rethink their processes and introduces observability into the mix because of the session tracing model in the system,” she said.

Parulekar said human checks can be built into the system, so the process is more transparent.

What makes this approach different from other workflow automation offerings is that it doesn’t rely on agents to decide what to do next; the system does. Unlike more traditional automation tools that route tasks and agents on probabilistic decision-making, this enforces execution on a more pre-defined, deterministic structure.

The problem it introduces

Codifying a workflow doesn’t fix a broken one. If a process has flawed steps, encoding it for agents locks in the problem at scale. And once workflows are distributed across agents, the challenge shifts from execution to governance: who owns the process, who validates it, and how it evolves when business conditions change.

It puts the onus on teams to take a hard look at what works for them and what doesn’t.

Organizations need to consider that, along with the execution control plane offered by platforms like Agentforce Operations, someone should be made responsible for task completion and success. 

Brandon Metcalf, founder and CEO of workforce orchestration company Asymbl, told VentureBeat in a separate interview that the key to both humans and agents following a workflow is a shared goal. 

“You have to understand the goal or the agent or human won’t complete the task successfully,” Metcalf said. “Someone has to manage that outcome that has to be delivered. It can be a person or an agent.”

The bottleneck has moved. As Metcalf framed it, the question is no longer whether agents can reason through a task, it’s whether the workflow underneath them is coherent enough to execute. For enterprises that built their processes around human judgment and institutional memory, that’s a harder fix than swapping in a smarter model.

Alibaba’s Metis agent cuts redundant AI tool calls from 98% to 2% — and gets more accurate doing it

One of the key challenges of building effective AI agents is teaching them to choose between using external tools or relying on their internal knowledge. But large language models are often trained to blindly invoke tools, which causes latency bottlenecks, unnecessary API costs, and degraded reasoning caused by environmental noise. 

To overcome this challenge, researchers at Alibaba introduced Hierarchical Decoupled Policy Optimization (HDPO), a reinforcement learning framework that trains agents to balance both execution efficiency and task accuracy. 

Metis, a multimodal model they trained using this framework, reduces redundant tool invocations from 98% to just 2% while establishing new state-of-the-art reasoning accuracy across key industry benchmarks. This framework helps create AI agents that are not trigger-happy and know when to abstain from using tools, enabling the development of responsive and cost-effective agentic systems.

The metacognitive deficit

Current agentic models face what the researchers call a “profound metacognitive deficit.” The models have a hard time deciding when to use their internal parametric knowledge versus when to query an external utility. As a result, they blindly invoke tools and APIs, like web search or code execution, even when the user’s prompt already contains all the necessary information to resolve the task.

This trigger-happy tool-calling behavior creates severe operational hurdles for real-world applications. Because the models are trained to focus almost entirely on task completion, they are indifferent to latency. These agents frequently hit exorbitant tool call rates. Every unnecessary external API call introduces a serial processing bottleneck, turning a technically capable AI into a sluggish system that frustrates users and burns through tool budgets.

At the same time, burning computational resources on excessive tool use does not translate to better reasoning. Redundant tool interactions inject noise into the model’s context. This noise can distract the model, derailing an otherwise sound chain of reasoning and actively degrading the final output.

To address the latency and cost issues of blind tool invocation, previous reinforcement learning methods attempted to penalize excessive tool usage by combining task accuracy and execution efficiency into one reward signal. However, this entangled design creates an unsolvable optimization dilemma. If the efficiency penalty is too aggressive, the model becomes overly conservative and suppresses essential tool use, sacrificing correctness on arduous tasks. Conversely, if the penalty is mild, the optimization signal loses its value and does not prevent tool overuse on simpler tasks.

Furthermore, this shared reward creates semantic ambiguity, where an inaccurate trajectory with zero tool calls might yield the same reward as an accurate trajectory with excessive tool usage. Because the training signals for accuracy and efficiency become entangled, the model can’t learn to control tool-use without degrading its core reasoning capabilities.

Hierarchical decoupled policy optimization

To solve the optimization dilemma of coupled rewards, the researchers introduced HDPO. HDPO separates accuracy and efficiency into two independent optimization channels. The accuracy channel focuses on maximizing task correctness across all of the model’s rollouts. The efficiency channel optimizes for execution economy.

HDPO computes the training signals for these two channels independently and only combines them at the final stage of loss computation. The efficiency signal is conditional upon the accuracy channel. This means that an incorrect response is never rewarded simply for being fast or using fewer tools. This decoupling avoids situations where accuracy and efficiency gradients cancel each other out, providing the AI with clean learning signals for both goals.

The most powerful emergent property of this decoupled design is that it creates an implicit cognitive curriculum. Early in training, when the model still struggles with the task, the optimization is dominated by the accuracy objective, forcing the model to prioritize learning correct reasoning and knowledge. As the model’s reasoning capabilities mature and it consistently arrives at the right answers, the efficiency signal smoothly scales up. This mechanism causes the model to first master task resolution, and only then refine its self-reliance by avoiding redundant, costly API calls.

To complement HDPO, the researchers developed a rigorous, multi-stage data curation regime that tackles severe flaws found in existing tool-augmented datasets. Their data curation pipeline covers supervised fine-tuning (SFT) and reinforcement learning (RL) stages.

For the SFT phase, they sourced data from publicly available tool-augmented multimodal trajectories and filtered them to remove low-quality examples containing execution failures or feedback inconsistencies. They also aggressively filtered out any training sample that the base model could solve directly without tools. Finally, using Google’s Gemini 3.1 Pro as an automated judge, they filtered the SFT corpus to only keep examples that demonstrated strategic tool use.

For the RL phase, the curation focused on ensuring a stable optimization signal. They filtered out prompts with corrupted visuals or semantic ambiguity. The HDPO algorithm relies on comparing correct and incorrect responses. If a task is trivially easy where the model always gets it right, or prohibitively hard where the model always fails, there is no meaningful mathematical variance to learn from. The team strictly retained only prompts that exhibited a non-trivial mix of successes and failures to guarantee an actionable gradient signal.

Metis agent: HDPO  in action

To test HDPO in action, the researchers used the framework to develop Metis, a multimodal reasoning agent equipped with coding and search tools. Metis is built on top of the Qwen3-VL-8B-Instruct vision-language model. The researchers trained it in two distinct stages. First, they applied SFT using their curated data to provide a cold-start initialization. Next, they applied RL using the HDPO framework, exposing the model to multi-turn interactions where it could invoke tools like Python code execution, text search, and image search.

The researchers pitted Metis against standard open-source vision models like LLaVA-OneVision, text-only reasoners, and state-of-the-art agentic models including DeepEyes V2 and the 30-billion-parameter Skywork-R1V4. The evaluation spanned two main areas: visual perception and document understanding datasets like HRBench and V*Bench, and rigorous mathematical and logical reasoning tasks like WeMath and MathVista.

On all tasks, Metis achieved state-of-the-art or highly competitive performance, outperforming existing agentic models — including the much larger 30-billion-parameter Skywork-R1V4 — across both visual perception and reasoning tasks.

Equally important is the anecdotal behavior Metis showed in the experiments. For example, when presented with an image of a museum sign and asked what the center text says, standard agentic models waste time blindly writing Python scripts to crop the image just to read it. Metis, however, recognizes that the text is clearly legible in the raw image. It skips the tools entirely and uses a single inference pass.

In another experiment, the model was given a complex chart and asked to identify the second-highest line at a specific data point within a tiny subplot. Metis recognized that fine-grained visual analysis exceeded its native resolution capabilities and could not accurately distinguish the overlapping lines. Instead of guessing from the full image, it invoked Python to crop and zoom in exclusively on that specific subplot region, allowing it to correctly identify the line. It treats code as a precision instrument deployed only when the visual evidence is genuinely ambiguous, not as a default fallback.

The researchers released Metis along with the code for HDPO under the permissive Apache 2.0 license.

“Our results demonstrate that strategic tool use and strong reasoning performance are not a trade-off; rather, eliminating noisy, redundant tool calls directly contributes to superior accuracy,” the researchers conclude. “More broadly, our work suggests a paradigm shift in tool-augmented learning: from merely teaching models how to execute tools, to cultivating the meta-cognitive wisdom of when to abstain from them.”

Writer launches AI agents that can act without prompts, taking on Amazon, Microsoft and Salesforce

Writer, the enterprise AI agent platform backed by Salesforce Ventures, Adobe Ventures, and Insight Partners, today launched event-based triggers for its Writer Agent platform, enabling AI agents to autonomously detect business signals across Gmail, Gong, Google Calendar, Google Drive, Microsoft SharePoint, and Slack — and execute complex multi-step workflows without any human initiating the process.

The release, which also includes a new Adobe Experience Manager connector and a suite of enhanced governance controls such as bring-your-own encryption keys and a Datadog observability plugin, represents Writer’s most aggressive bet yet on fully autonomous enterprise AI. It arrives at a moment when AWS, Salesforce, and Microsoft are all racing to establish their own agentic platforms, and when the question of how much autonomy enterprises will actually hand to AI agents remains deeply unresolved.

“We are launching a series of event triggers that power and drive our playbooks to be more proactively called,” Doris Jwo, Writer’s VP of Product Management, told VentureBeat ahead of the announcement. “We’re building on the ecosystem to actually for these connectors, such as SharePoint, Google Drive, Gong, Gmail, Google Calendar, actually listen for events happening in those platforms, so that the agent can practically know that something happened externally, and then, where relevant, call a certain playbook to be actually run live in real time, without any sort of human intervention required.”

The shift from reactive to proactive AI agents marks a critical inflection point for enterprise software. Until now, most AI assistants — including Writer’s own platform — required a human to initiate every interaction. A marketer had to open a chat window and ask for help. A salesperson had to prompt a research brief. The new event-based triggers flip that dynamic entirely: the system watches for business events and acts on its own.

Why Writer decided humans were the weakest link in enterprise AI workflows

Writer’s push toward autonomous triggers stems from a practical observation its product team made as enterprise customers scaled their use of the platform’s playbooks — the reusable, natural-language workflows that Writer introduced in November 2025 to let business users automate recurring tasks without writing code.

“What we found is, as playbooks continue to get integrated into enterprise workflows, it’s actually humans that become the bottleneck in making sure that playbooks get triggered,” Jwo said. “This really kind of solves that problem, to make sure that that sort of always-on, proactive, autonomous nature of that agent has continued to be built on.”

The mechanics work like this: Writer’s connectors, which already provided read and write access to third-party enterprise tools, now also listen for specific events — an email arriving in Gmail, a sales call completing in Gong, a new file landing in a Google Drive folder, a meeting starting or ending on Google Calendar, a message posted in Slack. When the system detects a qualifying event, it triggers a predefined playbook that executes a multi-step workflow autonomously.

Consider the use case Jwo described for marketing teams already running on Writer’s platform. An email campaign workflow typically begins when a creative brief lands in a Google Drive folder. From there, multiple team members coordinate through Slack to assemble research, build assets, draft copy, review graphics, and package everything for a campaign management tool. Writer’s event-based triggers collapse much of that chain: the moment a brief hits the designated folder, the system automatically fires a cascade of playbooks that assemble the research, generate the assets, and prepare deliverables for human review.

“All the playbooks that our customers have been building with us to build all those each individual pieces now just get automatically triggered the minute that initial brief kind of hits the Google Drive folder,” Jwo said. “That’s, I think, a very common workflow for most of these marketing sort of, like, content-heavy use cases, where it’s multiple parties involved, it’s a lot of assets coming together in a cascade.”

How Writer’s AI reasoning engine separates it from simple automation tools like Zapier

The comparison to Zapier — the popular automation tool that connects thousands of apps through if-this-then-that logic — is inevitable, and Jwo addressed it directly.

“It’s more than just an LLM in the middle,” she said. “It is an agent with reasoning and then access to a really powerful set of tools that includes connectors, that includes its own virtual sandbox, which enables it to do things like write and execute code on the fly and create those assets.”

The distinction matters for understanding where Writer sits in an increasingly crowded landscape. Zapier and similar workflow automation tools require users to manually define rigid logic paths, specifying exact conditions and actions in a deterministic sequence. Writer’s approach uses its Palmyra-powered reasoning engine to process event context and make real-time execution decisions. Users describe their goals in natural language rather than dragging around boxes and defining conditional branches.

“It’s not quite Zapier, because I think it requires a lot more — it’s more rigid,” Jwo said of traditional automation tools. “It requires more manual kind of setup to define the logic and the roles and the conditions for which a workflow has to be run.” Writer’s playbooks, by contrast, allow “a simple idea to turn into something that’s actually executable and repeatable,” she added, noting that builds take “hours and days, not weeks and months.”

This natural-language accessibility has been central to Writer’s strategy since it introduced the Agent platform and playbooks last November. The company has consistently positioned itself as a platform that puts power in the hands of business users — marketers, sales teams, operations leads — rather than requiring engineering resources to build and maintain AI workflows. Writer CEO May Habib made this case forcefully at Davos earlier this year, arguing that the leaders pulling ahead are those entering what she called “rebuild mode” — stripping workflows down to outcomes and eliminating what she described as the “coordination tax” of endless handoffs, status meetings, and alignment emails.

The event-based triggers extend that philosophy to its logical conclusion. If business users can build playbooks in natural language, and those playbooks can now fire automatically based on real-world business events, then the entire loop from signal to action can operate with minimal human involvement.

Inside the governance controls Writer built to make autonomous AI agents safe for regulated enterprises

That level of autonomy raises obvious concerns, and Writer appears to understand that governance is the linchpin of the entire strategy. The company paired its trigger launch with a substantial expansion of its administrative controls — a combination that suggests Writer views enterprise trust as its primary competitive weapon.

The new governance features include Connector Profiles, which allow administrators to configure multiple versions of the same connector with different permissions per team; Writer Agent Profiles for deploying customized agent configurations with specific capability toggles and security settings; AI Studio Observability for auditable tracking of every agent interaction; a Datadog Logs Plugin that forwards every LLM request and response as structured log events; and bring-your-own encryption key support through AWS, Azure, or GCP key management services.

“A really important part of that, and a baseline, sort of foundation for everything that we roll out, is our observability and governance platform,” Jwo told VentureBeat. “When connectors are set up, admins have full control over connector access, what is set up, who has access, which teams exactly are those access granted to, as well as individually, which exact tools do teams are able to call.”

The observability story extends to the individual user level as well. Jwo described Writer Agent’s user experience as built around progressive disclosure — clean initial views that users can expand to inspect the full chain of reasoning behind any agent action. “You can drill down to the actual tool call level,” she said. “You’d actually have the ability to look at specifically what web search results were pulled, what connector was called, what tool called, what succeeded, what failed, how did the agent divert its path to fulfill your goal.”

This transparency architecture reflects a broader conviction Writer has articulated through what it calls “The Agentic Compact” — a framework the company published for responsible AI that emphasizes foundational transparency, auditability, and human oversight. Dan Bikel, Writer’s head of AI, has argued publicly that the industry’s obsession with model scale has created what he calls a “transparency paradox,” leaving businesses with powerful tools they cannot fully understand or control. Writer’s governance-first approach to autonomous triggers represents the operational expression of that philosophy.

Writer also introduced its agent supervision suite in December 2025, offering centralized monitoring, agent approval workflows, global guardrails, and integrations with external observability and security platforms like Datadog, Noma, and Lakera. The event-based triggers now extend that governance framework to cover actions initiated without any human in the loop — a meaningfully harder problem.

Writer takes aim at AWS, Salesforce, and Microsoft in the escalating agentic platform wars

The timing of Writer’s announcement is not accidental. The enterprise agentic AI market has entered a period of intense platform competition, with the largest technology companies in the world staking claims to the same territory Writer occupies.

Jwo acknowledged the pressure directly when asked why a CIO would choose Writer over established vendor relationships with AWS, Salesforce, or Microsoft — all of which have announced agentic platforms of their own.

“At the baseline, I think we have all the pieces to be fully enterprise-grade and ready,” Jwo said. But she argued that Writer’s real advantage lies in accessibility for non-technical users. “A lot of the challenge has been: how do we get business users to actually be able to build these powerful workflows in a way that maybe a technical user, using coding agents, can do very quickly and well, but the typical business user is not accustomed to anything beyond typical prompting to actually create?”

That positioning — enterprise-grade capabilities wrapped in a business-user-friendly interface — has been Writer’s core differentiation since the company’s founding in 2020. It is also the reason Writer has attracted strategic investment from Salesforce Ventures and Adobe Ventures, both of which are building their own AI platforms but apparently see value in Writer’s approach to the business-user segment.

The company’s March 2026 release of Skills — reusable building blocks that encode a team’s specific methodologies, quality standards, and decision frameworks into the Agent platform — reinforced this direction. Skills allow marketing teams, for instance, to capture exactly how their best strategist structures competitive analysis or formats campaign briefs, then make that expertise available to every team member and every playbook across the organization. Combined with event-based triggers, the result is a system where institutional knowledge executes automatically in response to real-world business events.

Writer’s 2026 AI adoption survey, conducted with Workplace Intelligence and covering 2,400 global executives, found that 79% of enterprises face AI adoption challenges despite high investment — and that organizations with strong change management programs are six times more likely to reach production. Writer CMO Diego Lomanto has argued that the real barrier to AI adoption is not technology but trust, writing that “they treat resistance as a training problem when it’s actually a trust problem.” The governance-heavy approach to event-based triggers appears designed to address exactly that dynamic.

Salesforce, SAP, and Workday triggers are next as Writer expands its connector roadmap

Writer’s initial event trigger support covers Gmail, Gong, Google Calendar, Google Drive, SharePoint, and Slack — tools that Jwo described as “generally the most applicable to every end user.” But the company has its eye on deeper enterprise system integration.

When asked about CRM and ERP triggers for systems like Salesforce, SAP, and Workday, Jwo confirmed these are within the scope of the roadmap. “You can imagine, you know, a Salesforce opportunity is created that may trigger a cascade of events that happens,” she said. “You might want to set up the right assets, maybe the right customer environment, all sorts of things can kind of cascade from that.”

The connector ecosystem has been a strategic priority since Writer launched its MCP (Model Context Protocol) gateway in November 2025, providing governed agent access across enterprise systems including Microsoft 365, Google Workspace, HubSpot, Gong, PitchBook, FactSet, and others. The addition of Adobe Experience Manager in this release gives marketing teams direct read/write access to pages, fragments, and digital assets in Adobe’s content management system — a connector that closes the gap between AI-generated content and published output.

Jwo clarified that in most integration scenarios, Writer Agent delivers content in a draft state rather than publishing it directly. “Writer Agent basically accomplishes the majority of the workload — pulling together the assets, making the changes and presenting — and then hopefully a person just has to go through the last three or so final steps to get it out,” she said.

The real question enterprise AI must answer: how much autonomy is too much autonomy

The degree of autonomy enterprises are comfortable granting their AI agents remains one of the most consequential open questions in the industry. Jwo acknowledged that most customers still maintain human checkpoints in their workflows.

“You can also build in instructions into our playbooks to say, ‘Hey, before you move on to a next playbook, make sure that you check with me. I want to take a look, and then if I hit go, then you’re good to go,'” she said. The agent can also be designed with self-QA capabilities, validating outputs against known pitfalls before proceeding.

Writer plans to expand these checkpoint capabilities in the coming quarter, adding the ability to specify not just that a checkpoint is required but which specific person must respond and what types of responses are expected — essentially building a formal approval workflow into the autonomous trigger chain.

Jwo characterized the current system as a hybrid: the platform listens deterministically for predefined events, but the agent applies reasoning to decide what action to take — or whether to act at all. “The agent has the ability to process what happened, understand the context of it, and understand the intent of what you want to do, so it can make that decision,” she said. “You’re just saying, like, ‘Hey, the goal might be feedback is coming in, and we want to triage that in real time. And some things we might not want to action on, some things we do.’ You basically just explain that to the agent.”

She views this release as a stepping stone toward a future where agents are “even more mission-driven, and less governed by even like a set of instructions or roles” — a future where the AI doesn’t just respond to triggers but proactively identifies when action is needed based on broader organizational goals.

For now, Writer is betting that the combination of autonomous triggers, robust governance, and business-user accessibility will be enough to carve out defensible territory in an enterprise AI market where the biggest technology companies in the world are all converging on the same set of capabilities. The company’s argument is that having the foundational pieces is not enough — what matters is making those pieces work together in a way that non-technical business users can build, manage, and trust.

It is, in other words, the same wager Writer has been making since 2020 — that the future of enterprise AI belongs not to the platform with the most powerful model, but to the one that can get an entire organization to actually use it. The difference now is that the agents don’t wait to be asked.

Event-based triggers, new connectors, and enhanced governance controls are available immediately to Writer enterprise customers.