Most verticals aren’t clean, well-oiled SaaS databases; the reality is ugly documents, proprietary schemas, implicit workflows, and long‑running tasks that most general-purpose models struggle with.
This prompted construction project management company Trunk Tools to build a specialized, three-layer architecture — perception, semantics, agents — based on highly-detailed data to support high-accuracy, highly-relevant industry automation.
Their purpose-built stack has shrunk review cycles from months to days, prevented costly field errors, and given autonomous agents the ability to reason over millions of pages of documentation, Trunk says.
“We really set out to take the data from dispersed systems, pre-process it, structure it, go through our ontology into a knowledge graph, and then train AI models,” said Sarah Buchner, Trunk’s founder and CEO and a former carpenter.
For builders in other verticals, Trunk’s approach could serve as a blueprint for transforming data chaos into agent‑ready, industry-specific workflows.
Foundation LLMs, while powerful, are optimized for breadth, not always depth.
“General-purpose LLMs are trained to be okay at everything, so they’re weak at anything niche,” said Kriti Faujdar, a senior product manager working in AI infrastructure, agentic AI, security, and LLM platforms. For instance: Rare terms, domain-specific reasoning, the unspoken context that any practitioner “just knows.”
Web, app, and software developer Sébastien De Bollivier agreed that the biggest bottleneck is reliability on data that is “jargon-dense, abbreviation-heavy, and format-specific.”
“A GPT-4-class model can understand a French legal contract, but will fumble the specific article references practitioners need to cite,” he said.
Besides, the most valuable enterprise data never made it into pretraining anyway, Faujdar pointed out. It’s sitting in internal systems and proprietary formats. “RAG helps a little,” she said. “But it’s just giving better facts to a model that still can’t reason properly in the domain.”
Pre-training on domain data is critical; enterprises should then fine-tune on good task examples and build their own evals. “A few thousand examples from real practitioners beats millions of scraped, noisy ones,” Faujdar said.
Mixture-of-experts (MoE) can provide specialization without inference costs blowing up. Pairing RAG with fine-tuning also works well; RAG handles the factual long trail while fine-tuning fixes vocabulary and reasoning.
De Bollivier pointed to the advantage of hybrid stacks: A general-purpose model for reasoning and orchestration, a smaller fine-tuned model (or dense retrieval over a curated corpus) for domain-specific extraction. He advised: “Don’t fine-tune to make the model ‘smarter’ about a domain, fine-tune to make it more reliable on the specific output format your workflow requires.”
The trades and construction are certainly industries seeing traction with these techniques, as are legal and healthcare, De Bollivier said. These verticals have “high stakes for errors plus standardized document formats, equaling clear domain-training ROI.”
One honest caveat worth mentioning, Faujdar said: Specialized models can often fall apart outside their domain, so they’re often not useful outside their expertise (unless they’re re-trained).
In highly-specialized domains like construction, “data dumps” into large language models (LLMs) don’t cut it, said Trunk’s CTO Amrish Kapoor. This is because most transformers are probabilistic models: When given an image, they report back that it is “probably” a tree, or “probably” a child playing next to a tree.
This makes them insufficient for high‑precision symbolic interpretation. For instance, in construction documents, a 2-millimeter-wide symbol has a vastly different meaning depending on where it’s placed.
Further, constrained by context limits, probabilistic models struggle with long‑term project memory. “I don’t mean a context window of a few tokens,” Kapoor said. “I’m talking about long term memory that stretches across months and years, because this is how long some of these projects are.”
Instead, Trunk’s three-layer system breaks workflows into:
Perception (reading and extracting data from messy docs like PDFs, drawings, or scans)
A semantic/graph layer (making sense of that data and understanding their relationships).
LLMs and agents on top.
Construction drawings are typically symbolic, Buchner said. A door isn’t always labeled ‘door.’ Sometimes it’s simply an arc on a wall that a trained eye learns to read based on years of practice.
“The perception layer is what teaches AI to read that language,” she said. The semantic layer then gives that information meaning; for instance, connecting the door to the drawing that details it, the spec that governs it, and the trade that installs it. This helps answer project engineers’ critical questions: Not “is there a door here?” but “does this door create a problem down the line?”
Particularly in construction, that shift matters because the cost of a problem compounds with time. “A conflict caught in design is relatively low cost to address,” Buchner said, “whereas the same problem caught in the field might cost tens of thousands of dollars.”
At a high level, the system identifies the document type and begins extracting information based on content (drawing, schedules, paragraph text). This data is then “transformed and augmented” in the platform, which triggers agentic workflows like knowledge graph relationships and end-user workflows.
For instance, an agent might review an architecture bulletin and produce a visual overlay comparing an older version and a newer version (flagging additions and removals), then generate written narratives that describe what those changes are in simple terms. This helps users understand what’s changed and coordinate with trade partners on updated pricing and change orders.
Construction workflows are “ripe with implicit assumptions and connections between data in its myriad of sources,” Buchner said. And the amount of unstructured data is “humanly impossible” to process or make sense of.
Buchner estimated the average high-rise building generates about 3.6 million pages of corresponding documentation. “If you print it into a stack of papers it would be as high as the building itself.”
All three layers of Trunk’s stack — perception, semantic, LLM — are trained on “very specific datasets” from customers with “explicit permissions” and auto‑labeling/IP, Kapoor explained. Customers who don’t want Trunk training on their data can opt out.
Data is deidentified and aggregated, and Trunk also collects “tons more” labeled data through other pipelines like 3D building information modeling (BIM).
Trunk says it only ships agents that achieve around 95% accuracy. The team maintains continuous evaluation pipelines based on ground truth data from customers and experts. They also employ an LLMs-as-a-judge model.
“This notion of an LLM as a judge is to score how well you’re doing, both subjectively as well as objectively,” Kapoor said. Objectivity can be an easy ‘right’ or ‘not right,’ but subjectivity requires more nuance.
For instance, when creating an email or narrative or explanation, an LLM as a judge framework can create a composite score, or a numerical value that aggregates different metrics and tests a model’s performance or risk.
There can be challenges, though, particularly with latency, Buchner noted; any time the reasoning capacity of underlying models increases, the risk of latency goes up, too. Trunk maintains a set of evaluation criteria to objectively measure latency whenever changes are made to underlying infrastructure, agents, and API calls.
Then, “before we release to customers, we ensure marginal changes to the end-user experience are well worth the performance enhancements,” Buchner said.
Trunk’s platform powers seven AI agents purpose-built for construction, such as analyzing request for information (RFI) responses, overviewing bids, or reviewing drawings and submittals.
The submittal agent, for instance, flags missing, conflicting, or noncompliant information in product specs and RFIs. While it’s an essential step in the construction process, “it’s a super annoying workflow,” Buchner said, because human reviewers have to compare documents “with a bunch of other parts of documents.”
But the agent is able to do this in seconds, and Trunk says it has reduced submittal cycles from 50 to 60 days to 10, “which has massive schedule and financial implications.”
Trunk is now at a place where these agents are communicating directly with each other, which is “quite exciting,” Buchner said. So, for example, one agent will review an architectural drawing for accuracy, then autonomously hand it over to agents handling RFIs and asking follow-up questions.
“If the drawings have problems, the RFI agent is taking over and is actively reaching out for clarification,” Buchner explained.
Trunk says its customers report savings of 20 to 40 minutes per field question. Buchner said that users in the field know better than anyone how much of a “time suck” it is to go back and forth from office trailers, dig through project documents in scattered systems or printed PDFs, reconcile discrepancies, and return to coordinate with trade partners.
Trunk says its customers report these additional outcomes:
Average 8 minute time savings for single-document retrieval (status checks, location lookups, quantity queries).
Average 20 minute time savings for standard referencing (cross-referencing 2 to 3 spec sections to form an answer.
Average 40 minute time savings for multi-document research (listing and filtering queries, mapping relationships, analyzing RFIs and submittals across 4 to 6 documents).
Average 75 minute time savings for complex tasks (creating RFIs and other communication materials, deep cross-referencing across documents, change tracking).
In one instance, Trunk’s drawing review agent flagged that a structural beam had been moved up 8.5 inches. However, this was not documented by the architect. If the change hadn’t been caught, the project manager would likely have had to strip out and reinstall the right size beam, Buchner said. This rework would have added $10,000 or more to the budget, and “certainly there would have been implications on the schedule.”
Buchner also pointed to other examples: an agent flagged $60,000 in exaggerated pricing with no justification from landscaping subcontractors; identified a fireplace that needed to be sealed prior to drywall installation, saving around $100,000 in labor, materials, and delays; and called out that an electric door required a panel that wasn’t included in electrical drawings.
Trunk’s approach to building agents is applicable to any vertical working with high volumes of unstructured, industry-specific data.
Builders working in specific verticals must understand the industry’s specific data challenges their end users face and build technical infrastructure that can transform unstructured data into something an “LLM can traverse and understand,” Buchner said.
“Only then can you build the connections between data points that ultimately feed agentic workflows.”
A lot of money is being invested in foundational models, so enterprises should build modular systems that can leverage the strengths of various models as they continue to improve, Buchner advised.
Then, “build your technical advantage where the generic models are not investing and not performing well,” she said.
Most verticals aren’t clean, well-oiled SaaS databases; the reality is ugly documents, proprietary schemas, implicit workflows, and long‑running tasks that most general-purpose models struggle with.
This prompted construction project management company Trunk Tools to build a specialized, three-layer architecture — perception, semantics, agents — based on highly-detailed data to support high-accuracy, highly-relevant industry automation.
Their purpose-built stack has shrunk review cycles from months to days, prevented costly field errors, and given autonomous agents the ability to reason over millions of pages of documentation, Trunk Tools says.
“We really set out to take the data from dispersed systems, pre-process it, structure it, go through our ontology into a knowledge graph, and then train AI models,” said Sarah Buchner, Trunk Tools founder and CEO and a former carpenter.
For builders in other verticals, Trunk’s approach could serve as a blueprint for transforming data chaos into agent‑ready, industry-specific workflows.
Foundation LLMs, while powerful, are optimized for breadth, not always depth.
“General-purpose LLMs are trained to be okay at everything, so they’re weak at anything niche,” said Kriti Faujdar, a senior product manager working in AI infrastructure, agentic AI, security, and LLM platforms. For instance: Rare terms, domain-specific reasoning, the unspoken context that any practitioner “just knows.”
Web, app, and software developer Sébastien De Bollivier agreed that the biggest bottleneck is reliability on data that is “jargon-dense, abbreviation-heavy, and format-specific.”
“A GPT-4-class model can understand a French legal contract, but will fumble the specific article references practitioners need to cite,” he said.
Besides, the most valuable enterprise data never made it into pretraining anyway, Faujdar pointed out. It’s sitting in internal systems and proprietary formats. “RAG helps a little,” she said. “But it’s just giving better facts to a model that still can’t reason properly in the domain.”
Pre-training on domain data is critical; enterprises should then fine-tune on good task examples and build their own evals. “A few thousand examples from real practitioners beats millions of scraped, noisy ones,” Faujdar said.
Mixture-of-experts (MoE) can provide specialization without inference costs blowing up. Pairing RAG with fine-tuning also works well; RAG handles the factual long trail while fine-tuning fixes vocabulary and reasoning.
De Bollivier pointed to the advantage of hybrid stacks: A general-purpose model for reasoning and orchestration, a smaller fine-tuned model (or dense retrieval over a curated corpus) for domain-specific extraction. He advised: “Don’t fine-tune to make the model ‘smarter’ about a domain, fine-tune to make it more reliable on the specific output format your workflow requires.”
The trades and construction are certainly industries seeing traction with these techniques, as are legal and healthcare, De Bollivier said. These verticals have “high stakes for errors plus standardized document formats, equaling clear domain-training ROI.”
One honest caveat worth mentioning, Faujdar said: Specialized models can often fall apart outside their domain, so they’re often not useful outside their expertise (unless they’re re-trained).
In highly-specialized domains like construction, “data dumps” into large language models (LLMs) don’t cut it, said Trunk’s CTO Amrish Kapoor. This is because most transformers are probabilistic models: When given an image, they report back that it is “probably” a tree, or “probably” a child playing next to a tree.
This makes them insufficient for high‑precision symbolic interpretation. For instance, in construction documents, a 2-millimeter-wide symbol has a vastly different meaning depending on where it’s placed.
Further, constrained by context limits, probabilistic models struggle with long‑term project memory. “I don’t mean a context window of a few tokens,” Kapoor said. “I’m talking about long term memory that stretches across months and years, because this is how long some of these projects are.”
Instead, Trunk’s three-layer system breaks workflows into:
Perception (reading and extracting data from messy docs like PDFs, drawings, or scans)
A semantic/graph layer (making sense of that data and understanding their relationships).
LLMs and agents on top.
Construction drawings are typically symbolic, Buchner said. A door isn’t always labeled ‘door.’ Sometimes it’s simply an arc on a wall that a trained eye learns to read based on years of practice.
“The perception layer is what teaches AI to read that language,” she said. The semantic layer then gives that information meaning; for instance, connecting the door to the drawing that details it, the spec that governs it, and the trade that installs it. This helps answer project engineers’ critical questions: Not “is there a door here?” but “does this door create a problem down the line?”
Particularly in construction, that shift matters because the cost of a problem compounds with time. “A conflict caught in design is relatively low cost to address,” Buchner said, “whereas the same problem caught in the field might cost tens of thousands of dollars.”
At a high level, the system identifies the document type and begins extracting information based on content (drawing, schedules, paragraph text). This data is then “transformed and augmented” in the platform, which triggers agentic workflows like knowledge graph relationships and end-user workflows.
For instance, an agent might review an architecture bulletin and produce a visual overlay comparing an older version and a newer version (flagging additions and removals), then generate written narratives that describe what those changes are in simple terms. This helps users understand what’s changed and coordinate with trade partners on updated pricing and change orders.
Construction workflows are “ripe with implicit assumptions and connections between data in its myriad of sources,” Buchner said. And the amount of unstructured data is “humanly impossible” to process or make sense of.
Buchner estimated the average high-rise building generates about 3.6 million pages of corresponding documentation. “If you print it into a stack of papers it would be as high as the building itself.”
All three layers of Trunk’s stack — perception, semantic, LLM — are trained on “very specific datasets” from customers with “explicit permissions” and auto‑labeling/IP, Kapoor explained. Customers who don’t want Trunk training on their data can opt out.
Data is deidentified and aggregated, and Trunk also collects “tons more” labeled data through other pipelines like 3D building information modeling (BIM).
Trunk says it only ships agents that achieve around 95% accuracy. The team maintains continuous evaluation pipelines based on ground truth data from customers and experts. They also employ an LLMs-as-a-judge model.
“This notion of an LLM as a judge is to score how well you’re doing, both subjectively as well as objectively,” Kapoor said. Objectivity can be an easy ‘right’ or ‘not right,’ but subjectivity requires more nuance.
For instance, when creating an email or narrative or explanation, an LLM as a judge framework can create a composite score, or a numerical value that aggregates different metrics and tests a model’s performance or risk.
There can be challenges, though, particularly with latency, Buchner noted; any time the reasoning capacity of underlying models increases, the risk of latency goes up, too. Trunk maintains a set of evaluation criteria to objectively measure latency whenever changes are made to underlying infrastructure, agents, and API calls.
Then, “before we release to customers, we ensure marginal changes to the end-user experience are well worth the performance enhancements,” Buchner said.
Trunk’s platform powers seven AI agents purpose-built for construction, such as analyzing request for information (RFI) responses, overviewing bids, or reviewing drawings and submittals.
The submittal agent, for instance, flags missing, conflicting, or noncompliant information in product specs and RFIs. While it’s an essential step in the construction process, “it’s a super annoying workflow,” Buchner said, because human reviewers have to compare documents “with a bunch of other parts of documents.”
But the agent is able to do this in seconds, and Trunk says it has reduced submittal cycles from 50 to 60 days to 10, “which has massive schedule and financial implications.”
Trunk is now at a place where these agents are communicating directly with each other, which is “quite exciting,” Buchner said. So, for example, one agent will review an architectural drawing for accuracy, then autonomously hand it over to agents handling RFIs and asking follow-up questions.
“If the drawings have problems, the RFI agent is taking over and is actively reaching out for clarification,” Buchner explained.
Trunk says its customers report savings of 20 to 40 minutes per field question. Buchner said that users in the field know better than anyone how much of a “time suck” it is to go back and forth from office trailers, dig through project documents in scattered systems or printed PDFs, reconcile discrepancies, and return to coordinate with trade partners.
Trunk says its customers report these additional outcomes:
Average 8 minute time savings for single-document retrieval (status checks, location lookups, quantity queries).
Average 20 minute time savings for standard referencing (cross-referencing 2 to 3 spec sections to form an answer.
Average 40 minute time savings for multi-document research (listing and filtering queries, mapping relationships, analyzing RFIs and submittals across 4 to 6 documents).
Average 75 minute time savings for complex tasks (creating RFIs and other communication materials, deep cross-referencing across documents, change tracking).
In one instance, Trunk’s drawing review agent flagged that a structural beam had been moved up 8.5 inches. However, this was not documented by the architect. If the change hadn’t been caught, the project manager would likely have had to strip out and reinstall the right size beam, Buchner said. This rework would have added $10,000 or more to the budget, and “certainly there would have been implications on the schedule.”
Buchner also pointed to other examples: an agent flagged $60,000 in exaggerated pricing with no justification from landscaping subcontractors; identified a fireplace that needed to be sealed prior to drywall installation, saving around $100,000 in labor, materials, and delays; and called out that an electric door required a panel that wasn’t included in electrical drawings.
Trunk’s approach to building agents is applicable to any vertical working with high volumes of unstructured, industry-specific data.
Builders working in specific verticals must understand the industry’s specific data challenges their end users face and build technical infrastructure that can transform unstructured data into something an “LLM can traverse and understand,” Buchner said.
“Only then can you build the connections between data points that ultimately feed agentic workflows.”
A lot of money is being invested in foundational models, so enterprises should build modular systems that can leverage the strengths of various models as they continue to improve, Buchner advised.
Then, “build your technical advantage where the generic models are not investing and not performing well,” she said.
Two-thirds of enterprises have hedged their AI model strategy, and the past few weeks of controversy around Anthropic’s Claude Fable 5 model showed why that posture has gone mainstream.
On June 12, a U.S. export-control order pulled Anthropic’s Claude Fable 5 — the most capable model on the market — offline for every customer, with no warning and no timeline. It returned this week wrapped in tighter safeguards, after China’s Z.ai released its open-weights GLM-5.2 into the vacuum. New VentureBeat Pulse Research, which surveyed 145 enterprises across these last few weeks, shows that two-thirds had already hedged their model strategy before the order came down: 51% blend closed frontier models with open-weight models deployed on their own infrastructure, and another 16% are moving core workflows off closed APIs entirely. The remaining third was all-in on closed ecosystems when the lights went out.
The blackout put a spotlight on vendor dependency, by showing what happens when the model you rely on disappears. But vendor dependency is only the most visible piece of a deeper problem: Most enterprises lack the monitoring to know when an AI system they’ve put into production stops working correctly.
Just 1 in 10 enterprises has automated monitoring that would catch an AI model drifting, misbehaving, or failing in production. Roughly a quarter would learn of a production failure only when end users — internal or external — report it, or lack the visibility to detect it at all. And 79% of enterprise organizations have already taken a real financial or operational hit from autonomous agents — most often shadow AI, unauthorized agentic work run by enterprises’ own employees on corporate credit cards, outside anyone’s oversight.
We call this the “Control Gap,” or the distance between how aggressively enterprises are deploying AI and how little of it they can see, own, or govern. June’s blackout turned this into a live stress test.
About this data: VentureBeat Pulse Research surveyed 145 qualified respondents at organizations with 100 or more employees in June 2026, with fielding spanning the Fable 5 blackout that began June 12. The sample is self-selected and directional: 41% work in technology/software, 20% are consultants or advisors, and the respondent base skews senior and technical — CIO/CTO/CISOs (18%), directors of engineering/IT (14%), enterprise architects (12%). More than half of the respondents were from companies with 10,000 employees or more.
While our sample is not huge, what you can trust more than the exact percentages is the pattern: Every question in the survey, independently, points the same way, with deployment running ahead of governance, visibility, and cost control.
The full methodology is in the report.
Fable 5 launched June 9 to immediate acclaim — and sticker shock, at $10 per million input tokens and $50 per million output. Three days later, the U.S. government issued an emergency export-control directive barring access by foreign nationals. Anthropic, with no way to verify nationality in real time, suspended the model for everyone.
Z.ai has continued to pick up momentum; on Wednesday it released an open agentic coding environment, called Zcode. OpenAI, meanwhile, previewed its cutting-edge GPT-5.6 line on June 26.
Enterprises had already spent the spring learning what AI dependence costs in dollars. Uber burned through its entire 2026 AI coding budget in four months after Claude Code adoption hit 84% of its roughly 5,000 engineers, Forbes reported. Microsoft canceled most internal Claude Code licenses in its Windows and Microsoft 365 division, steering engineers to its own tooling, according to The Verge.
June added the harder lesson: The model your workflows depend on can vanish overnight, by government order, through no decision of yours or your vendor’s. And Chinese companies like DeepSeek were releasing hugely disruptive, powerful models, driving down costs to a fraction of Western ones.
Brian Craig, senior director of architecture at Liberty IT, the Ireland-based engineering arm of Liberty Mutual, one of the world’s largest insurance companies, saw both lessons collide in real time. Craig is Irish, which meant the export order hit him directly as a foreign-national user.
Onstage at VentureBeat’s AI Impact event in New York on June 24, mid-blackout, I asked him about it. “Fable arrived, and immediately you saw the sticker price of using it, and you went, ‘Ooh, goodness, it better be really good,'” Craig said. “But luckily enough, we didn’t get to use it enough to get to fall in love with it.” Then it was gone.
Craig’s company was built to route around exactly this kind of disruption. Liberty IT runs what it calls an AI backbone — roughly 50 components spanning security, governance, observability, and orchestration, each independently replaceable.
“You can’t lock in right now in one vendor and even one framework,” Craig told the room. “You need to keep being able to have the flexibility with that backbone to be able to hook into different models, different vendors, depending not so much on who’s the flavor of the day, but on what you can feel confident about for the next six months.”
The survey shows Craig has plenty of company. A 51% majority of enterprises run a hybrid posture — closed frontier models for general reasoning, open-weight models deployed locally for specialized execution — and 16% are making a hard pivot, moving core workflows onto open weights running on their own hybrid or private cloud. The 32% holding a closed commitment are candid about why: The operational overhead of self-hosting still outweighs the savings for them. After June, that calculus has a new variable in it.
Defection is now the active posture, and the target may surprise you. Asked which primary AI vendor they are most likely to downsize or phase out over the next 12 months, respondents named Microsoft first at 30% — most citing cutbacks to Copilot and Azure AI frameworks in favor of direct model access — ahead of the 28% who plan to trim no vendor at all. OpenAI drew 21%, largely on pricing volatility, with Anthropic at 15% and Google at 6%. No vendor faces an exodus. But loyalty by inertia has ended: Among these enterprises, actively cutting at least one provider is now more common than expanding across all of them.
How would an enterprise know if one of its production AI models was drifting, behaving unsafely, or failing to complete tasks? We asked directly. Forty percent say they are very confident they would detect it. The question also asked what that confidence rests on, and respondents split into two camps: 30% rely on humans reviewing critical AI outputs, and just 10% — 14 of the 145 organizations — have automated monitoring and alerting running against production systems. The remaining respondents hold weaker positions still: 32% expect to catch most issues “eventually,” 19% say they would likely hear about a failure from end users first, and 8% report no systematic visibility into production AI behavior at all.
That distinction matters because the two approaches are very different. Human review may seem like the gold standard, but it only reaches the outputs someone designates as important for such a review — and it happens at the pace humans can move at, with the inconsistency any manual process carries. Automated monitoring watches everything the system produces, continuously, and flags anomalies as they happen — for the same reason enterprises stopped depending on manual checks for uptime and security a decade ago.
As agentic workloads multiply output volumes far beyond what any review team can read, the manual approach starts to fall behind. The leaders at our June 24 event in New York treat human review as a designed control with automation underneath it. “Nothing gets deployed into production unless it’s a human actually reviewing it and signing off,” Craig said of Liberty’s agentic software factory, where planning, coding, testing, critic, and librarian agents ship features from epic to production.
“It always has to be risk-based. That’s why we work for an insurance company.” Todd Johnson, the Morgan Stanley managing director who runs agentic AI across the bank’s end-of-day P&L controller process, described the same principle from finance: “One of our strong principles in our AI governance generally is that there always has to be human accountability, even if there’s a degree of automation.” VentureBeat covered Morgan Stanley’s new results around its P&L resolution agent system separately.
Liberty Mutual and Morgan Stanley chose manual sign-off deliberately, layered on top of observability, identity, and governance infrastructure. Whether the human-review camp has similar infrastructure underneath is more than a single-select question can establish. The 16% who separately named missing observability tooling as their biggest governance barrier are the ones saying outright that it hasn’t been built.
Why does the AI visibility tooling never get built? The respondents’ answers suggest it is an organizational shortcoming. The single most-cited barrier to governing AI across platforms is the absence of a single owner or accountable team, at 32%. Vendor opacity follows at 25%, missing tooling at 16% — and a lack of talent lands dead last at 5%.
The skills exist, but the organizational mandate does not: Only 38% say a central team actually governs AI behavior across their platforms today, 21% say ownership is unclear or actively contested between teams, and 17% say no role holds formal accountability at all.
The AI surface being governed makes the vacuum worse. Fully 85% of enterprises run two or more platforms each claiming to be the “primary” AI layer — ERP, ITSM, productivity suite, data platform, each with its own AI, its own controls, and its own assumptions. 36% describe an open contest between four or more. Just 8% have consolidated to one. Asked in a free-text question what one thing they would fix, respondents converged from different directions on the same answer: a single accountable owner, and a control plane that abstracts cost, drift, and model choice away from the end user.
The cost of the vacuum is showing up on corporate cards.
Asked to name the most severe financial or operational control failure they have experienced from autonomous agents, 49% of enterprises cite shadow AI — departmental teams running unauthorized agentic pipelines on corporate credit cards, bypassing central financial oversight entirely. Another 25% have been hit by an infinite-loop bill, an uncaught recursive workflow racking up thousands in token costs in a single incident, and 6% by an agent that degraded production databases with unthrottled queries. Only 21% report guarded stability, with hard token throttling and budget caps at the infrastructure layer. Add it up: 79% of these enterprises have already paid for an agent control failure in real money or real downtime.
Finally, the economics of tokens suggest the pressure will keep rising. Per-token inference costs are falling 70 to 80% a year, and agentic workloads consume 100 to 500 times the tokens of the LLM tools they replaced.
Brian Gracely, senior director of portfolio strategy at Red Hat, told our New York audience the answer starts with right-sizing: “If I’m simply trying to resolve an insurance claim, I don’t need to know about the history of Western civilization in my model. I don’t need to know soccer scores.”
Enterprises are pairing smaller, specialized models with semantic routing, he said, so the platform decides which requests genuinely need frontier-scale reasoning — and which are burning premium tokens on commodity work. (One adjacent data point from the survey underlines the appetite for pragmatism: 73% of enterprises report little or nothing to show for their custom fine-tuning investments of the past 18 months — a reckoning we’ll examine in its own report.)
The survey describes enterprises moving fast on AI with weak controls underneath. 58% are adding more AI initiatives than they retire. 85% run multiple platforms that each claim to be the primary AI layer. Three times as many enterprises rely on human review to catch a failing production model as have automated monitoring in place. And 79% have already paid for an agent control failure — most often unauthorized agent spending on corporate cards, outside IT’s oversight.
On one problem, enterprises have clearly adapted: model dependency. Two-thirds hedge their model strategy, either running open-weight models alongside closed ones (51%) or moving core workflows off closed APIs entirely (16%). The Fable 5 shutdown showed the value of that position — the hedged companies could route around a model that a government order made unavailable overnight.
The remaining problems are internal, and no purchase fixes them: 32% name the lack of a single accountable owner as their top governance barrier, and 17% say no role holds formal accountability for AI at all. Assigning an owner costs nothing and requires no vendor. It still hasn’t happened at most of these companies.
Our coming Q3 wave of research will measure whether June changed this — whether enterprises assigned owners and installed automated monitoring, or just added a second model and moved on.
Get the full Control Gap report here.
The themes in this report — agent orchestration, governance, and cost control — are the agenda at VB Transform, VentureBeat’s flagship event, July 14-15 at Hotel Nia in Menlo Park, with technical leaders from Visa, GM, Waymo, Intuit, Instacart, LangChain and others. Details and registration here.
Disclosure: VentureBeat’s June 24 AI Impact event in New York was sponsored by Red Hat and Intel. Sponsors have no input into VentureBeat Pulse Research survey design, findings, or editorial coverage.
Two-thirds of enterprises have hedged their AI model strategy, and the past few weeks of controversy around Anthropic’s Claude Fable 5 model showed why that posture has gone mainstream.
On June 12, a U.S. export-control order pulled Anthropic’s Claude Fable 5 — the most capable model on the market — offline for every customer, with no warning and no timeline. It returned this week wrapped in tighter safeguards, after China’s Z.ai released its open-weights GLM-5.2 into the vacuum. New VentureBeat Pulse Research, which surveyed 145 enterprises across these last few weeks, shows that two-thirds had already hedged their model strategy before the order came down: 51% blend closed frontier models with open-weight models deployed on their own infrastructure, and another 16% are moving core workflows off closed APIs entirely. The remaining third was all-in on closed ecosystems when the lights went out.
The blackout put a spotlight on vendor dependency, by showing what happens when the model you rely on disappears. But vendor dependency is only the most visible piece of a deeper problem: Most enterprises lack the monitoring to know when an AI system they’ve put into production stops working correctly.
Just 1 in 10 enterprises has automated monitoring that would catch an AI model drifting, misbehaving, or failing in production. Roughly a quarter would learn of a production failure only when end users — internal or external — report it, or lack the visibility to detect it at all. And 79% of enterprise organizations have already taken a real financial or operational hit from autonomous agents — most often shadow AI, unauthorized agentic work run by enterprises’ own employees on corporate credit cards, outside anyone’s oversight.
We call this the “Control Gap,” or the distance between how aggressively enterprises are deploying AI and how little of it they can see, own, or govern. June’s blackout turned this into a live stress test.
About this data: VentureBeat Pulse Research surveyed 145 qualified respondents at organizations with 100 or more employees in June 2026, with fielding spanning the Fable 5 blackout that began June 12. The sample is self-selected and directional: 41% work in technology/software, 20% are consultants or advisors, and the respondent base skews senior and technical — CIO/CTO/CISOs (18%), directors of engineering/IT (14%), enterprise architects (12%). More than half of the respondents were from companies with 2,500 employees or more.
While our sample is not huge, what you can trust more than the exact percentages is the pattern: Every question in the survey, independently, points the same way, with deployment running ahead of governance, visibility, and cost control.
The full methodology is in the report.
Fable 5 launched June 9 to immediate acclaim — and sticker shock, at $10 per million input tokens and $50 per million output. Three days later, the U.S. government issued an emergency export-control directive barring access by foreign nationals. Anthropic, with no way to verify nationality in real time, suspended the model for everyone.
Z.ai has continued to pick up momentum; on Wednesday it released an open agentic coding environment, called Zcode. OpenAI, meanwhile, previewed its cutting-edge GPT-5.6 line on June 26.
Enterprises had already spent the spring learning what AI dependence costs in dollars. Uber burned through its entire 2026 AI coding budget in four months after Claude Code adoption hit 84% of its roughly 5,000 engineers, Forbes reported. Microsoft canceled most internal Claude Code licenses in its Windows and Microsoft 365 division, steering engineers to its own tooling, according to The Verge.
June added the harder lesson: The model your workflows depend on can vanish overnight, by government order, through no decision of yours or your vendor’s. And Chinese companies like DeepSeek were releasing hugely disruptive, powerful models, driving down costs to a fraction of Western ones.
Brian Craig, senior director of architecture at Liberty IT, the Ireland-based engineering arm of Liberty Mutual, one of the world’s largest insurance companies, saw both lessons collide in real time. Craig is Irish, which meant the export order hit him directly as a foreign-national user.
Onstage at VentureBeat’s AI Impact event in New York on June 24, mid-blackout, I asked him about it. “Fable arrived, and immediately you saw the sticker price of using it, and you went, ‘Ooh, goodness, it better be really good,'” Craig said. “But luckily enough, we didn’t get to use it enough to get to fall in love with it.” Then it was gone.
Craig’s company was built to route around exactly this kind of disruption. Liberty IT runs what it calls an AI backbone — roughly 50 components spanning security, governance, observability, and orchestration, each independently replaceable.
“You can’t lock in right now in one vendor and even one framework,” Craig told the room. “You need to keep being able to have the flexibility with that backbone to be able to hook into different models, different vendors, depending not so much on who’s the flavor of the day, but on what you can feel confident about for the next six months.”
The survey shows Craig has plenty of company. A 51% majority of enterprises run a hybrid posture — closed frontier models for general reasoning, open-weight models deployed locally for specialized execution — and 16% are making a hard pivot, moving core workflows onto open weights running on their own hybrid or private cloud. The 32% holding a closed commitment are candid about why: The operational overhead of self-hosting still outweighs the savings for them. After June, that calculus has a new variable in it.
Defection is now the active posture, and the target may surprise you. Asked which primary AI vendor they are most likely to downsize or phase out over the next 12 months, respondents named Microsoft first at 30% — most citing cutbacks to Copilot and Azure AI frameworks in favor of direct model access — ahead of the 28% who plan to trim no vendor at all. OpenAI drew 21%, largely on pricing volatility, with Anthropic at 15% and Google at 6%. No vendor faces an exodus. But loyalty by inertia has ended: Among these enterprises, actively cutting at least one provider is now more common than expanding across all of them.
How would an enterprise know if one of its production AI models was drifting, behaving unsafely, or failing to complete tasks? We asked directly. Forty percent say they are very confident they would detect it. The question also asked what that confidence rests on, and respondents split into two camps: 30% rely on humans reviewing critical AI outputs, and just 10% — 14 of the 145 organizations — have automated monitoring and alerting running against production systems. The remaining respondents hold weaker positions still: 32% expect to catch most issues “eventually,” 19% say they would likely hear about a failure from end users first, and 8% report no systematic visibility into production AI behavior at all.
That distinction matters because the two approaches are very different. Human review may seem like the gold standard, but it only reaches the outputs someone designates as important for such a review — and it happens at the pace humans can move at, with the inconsistency any manual process carries. Automated monitoring watches everything the system produces, continuously, and flags anomalies as they happen — for the same reason enterprises stopped depending on manual checks for uptime and security a decade ago.
As agentic workloads multiply output volumes far beyond what any review team can read, the manual approach starts to fall behind. The leaders at our June 24 event in New York treat human review as a designed control with automation underneath it. “Nothing gets deployed into production unless it’s a human actually reviewing it and signing off,” Craig said of Liberty’s agentic software factory, where planning, coding, testing, critic, and librarian agents ship features from epic to production.
“It always has to be risk-based. That’s why we work for an insurance company.” Todd Johnson, the Morgan Stanley managing director who runs agentic AI across the bank’s end-of-day P&L controller process, described the same principle from finance: “One of our strong principles in our AI governance generally is that there always has to be human accountability, even if there’s a degree of automation.” VentureBeat covered Morgan Stanley’s new results around its P&L resolution agent system separately.
Liberty Mutual and Morgan Stanley chose manual sign-off deliberately, layered on top of observability, identity, and governance infrastructure. Whether the human-review camp has similar infrastructure underneath is more than a single-select question can establish. The 16% who separately named missing observability tooling as their biggest governance barrier are the ones saying outright that it hasn’t been built.
Why does the AI visibility tooling never get built? The respondents’ answers suggest it is an organizational shortcoming. The single most-cited barrier to governing AI across platforms is the absence of a single owner or accountable team, at 32%. Vendor opacity follows at 25%, missing tooling at 16% — and a lack of talent lands dead last at 5%.
The skills exist, but the organizational mandate does not: Only 38% say a central team actually governs AI behavior across their platforms today, 21% say ownership is unclear or actively contested between teams, and 17% say no role holds formal accountability at all.
The AI surface being governed makes the vacuum worse. Fully 85% of enterprises run two or more platforms each claiming to be the “primary” AI layer — ERP, ITSM, productivity suite, data platform, each with its own AI, its own controls, and its own assumptions. 36% describe an open contest between four or more. Just 8% have consolidated to one. Asked in a free-text question what one thing they would fix, respondents converged from different directions on the same answer: a single accountable owner, and a control plane that abstracts cost, drift, and model choice away from the end user.
The cost of the vacuum is showing up on corporate cards.
Asked to name the most severe financial or operational control failure they have experienced from autonomous agents, 49% of enterprises cite shadow AI — departmental teams running unauthorized agentic pipelines on corporate credit cards, bypassing central financial oversight entirely. Another 25% have been hit by an infinite-loop bill, an uncaught recursive workflow racking up thousands in token costs in a single incident, and 6% by an agent that degraded production databases with unthrottled queries. Only 21% report guarded stability, with hard token throttling and budget caps at the infrastructure layer. Add it up: 79% of these enterprises have already paid for an agent control failure in real money or real downtime.
Finally, the economics of tokens suggest the pressure will keep rising. Per-token inference costs are falling 70 to 80% a year, and agentic workloads consume 100 to 500 times the tokens of the LLM tools they replaced.
Brian Gracely, senior director of portfolio strategy at Red Hat, told our New York audience the answer starts with right-sizing: “If I’m simply trying to resolve an insurance claim, I don’t need to know about the history of Western civilization in my model. I don’t need to know soccer scores.”
Enterprises are pairing smaller, specialized models with semantic routing, he said, so the platform decides which requests genuinely need frontier-scale reasoning — and which are burning premium tokens on commodity work. (One adjacent data point from the survey underlines the appetite for pragmatism: 73% of enterprises report little or nothing to show for their custom fine-tuning investments of the past 18 months — a reckoning we’ll examine in its own report.)
The survey describes enterprises moving fast on AI with weak controls underneath. 58% are adding more AI initiatives than they retire. 85% run multiple platforms that each claim to be the primary AI layer. Three times as many enterprises rely on human review to catch a failing production model as have automated monitoring in place. And 79% have already paid for an agent control failure — most often unauthorized agent spending on corporate cards, outside IT’s oversight.
On one problem, enterprises have clearly adapted: model dependency. Two-thirds hedge their model strategy, either running open-weight models alongside closed ones (51%) or moving core workflows off closed APIs entirely (16%). The Fable 5 shutdown showed the value of that position — the hedged companies could route around a model that a government order made unavailable overnight.
The remaining problems are internal, and no purchase fixes them: 32% name the lack of a single accountable owner as their top governance barrier, and 17% say no role holds formal accountability for AI at all. Assigning an owner costs nothing and requires no vendor. It still hasn’t happened at most of these companies.
Our coming Q3 wave of research will measure whether June changed this — whether enterprises assigned owners and installed automated monitoring, or just added a second model and moved on.
Get the full Control Gap report here.
The themes in this report — agent orchestration, governance, and cost control — are the agenda at VB Transform, VentureBeat’s flagship event, July 14-15 at Hotel Nia in Menlo Park, with technical leaders from Visa, GM, Waymo, Intuit, Instacart, LangChain and others. Details and registration here.
Disclosure: VentureBeat’s June 24 AI Impact event in New York was sponsored by Red Hat and Intel. Sponsors have no input into VentureBeat Pulse Research survey design, findings, or editorial coverage.
As enterprise AI systems scale to handle complex workflows, practitioners face the challenge of routing subtasks to the right tools and skills. Agents can have hundreds of tools and skills and get confused on which one to use for each step of a workflow.
To address this challenge, researchers at Alibaba developed SkillWeaver, a framework that creates an execution graph for a given task and chooses the right skills for each of the nodes. They also introduce Skill-Aware Decomposition (SAD), a novel technique that uses a feedback loop to enable the agent to fetch and vet relevant tool candidates iteratively. This compositional approach and feedback loop mechanism distinguishes SkillWeaver from other tool-routing frameworks that choose tools in a one-shot fashion.
SkillWeaver relates to real-world AI applications where agents autonomously orchestrate multi-tool ecosystems, such as the Model Context Protocol (MCP), to execute multi-step business operations like downloading datasets, transforming information, and creating visual reports.
In practice, the researchers’ experiments with SkillWeaver show that implementing this retrieve-and-route approach significantly increases accuracy while reducing token consumption by over 99% compared to naively exposing agents to an entire tool library.
For practitioners building AI agents, the main takeaway is that the granularity of task decomposition is the biggest bottleneck to accurate tool retrieval.
Skills are a key pattern in modern LLM agent architectures. A skill is a modular, reusable tool specification that uses structured natural language documentation.
As enterprise agents integrate with massive tool ecosystems, accurately routing user queries to the right skills becomes a difficult task. Exposing an entire library to an LLM to find the right tool is highly inefficient, quickly overwhelms context limits, and consumes hundreds of thousands of tokens.
Most current tool-use frameworks attempt to solve this through API retrieval, documentation matching, or hierarchical structures that treat routing strictly as a single-skill selection or per-step problem.
However, this single-skill paradigm is insufficient for enterprise environments because real-world queries are inherently compositional. A standard business request such as “Download the dataset, transform it, and create visual reports” cannot be fulfilled by one tool. It requires breaking the prompt down and sequencing an API client, a data processor, and a visualization tool into a cohesive, multi-step execution plan.
To tackle this, the researchers frame the problem of handling complex tasks that require multiple skills as “compositional skill routing.” Given a complex user prompt and a vast library of tools, an agent must simultaneously figure out how to break the request into a sequence of atomic sub-tasks, how to map each sub-task to the single best available skill, and how to compose those skills into an executable plan.
SkillWeaver orchestrates this process through three distinct stages: Decompose, Retrieve, and Compose. In the first stage, an LLM acts as a task decomposer, breaking the user’s complex query down into a sequence of sub-tasks that each require one skill. Once the sub-tasks are clearly defined, the system uses an embedding model to compare each subtask against the skill library to pull a shortlist of the top candidate tools for each step.
In the final stage, a planner evaluates the retrieved candidates based on how well they work together. It checks for inter-skill compatibility to ensure the outputs of one tool naturally flow into the inputs of the next. It then creates a final execution plan as a Directed Acyclic Graph (DAG) that maps out dependencies so independent tasks can potentially execute in parallel.
For example, consider a user asking an AI agent to “Download the dataset, transform it, and create visual reports.” In the decompose stage, the decomposer LLM breaks this into three distinct sub-tasks: downloading the dataset, transforming the data, and creating the reports.
In the retrieve stage, the system searches the library and finds candidates like “api-client” or “http-fetch” for task one, “csv-parser” or “etl-pipeline” for task two, and so on. Finally, the compose stage evaluates these options, selects the specific combination of “api-client,” “csv-parser,” and “chart-gen” that are most compatible, and wires them together into a final, ready-to-execute workflow.
A key challenge of this pipeline is that LLMs often produce generic step descriptions that fail to match the specific, technical vocabulary of the actual skills available in the library. To fix this, SkillWeaver introduces Iterative Skill-Aware Decomposition (SAD), a novel feedback loop. SAD works by having the LLM draft an initial plan, conducting a preliminary search to find loosely matching skills, and then feeding those retrieved skills back into the LLM as hints. This allows the LLM to rewrite its decomposition so the granularity and vocabulary perfectly align with the actual tools that exist.
To evaluate how SkillWeaver performs in realistic enterprise scenarios, the researchers created a custom benchmark called CompSkillBench. It consists of 300 multi-step queries of different difficulty levels. To mirror real-world environments, they used a library of 2,209 real-world skills sourced from the public MCP ecosystem, covering 24 functional categories like cloud infrastructure, finance, and databases.
For the core engine, the researchers primarily used a lightweight 7-billion parameter model (Qwen2.5-7B-Instruct) for task decomposition, paired with a standard semantic search retriever (MiniLM with a FAISS index) to find the tools. SkillWeaver was evaluated against three main setups: a brute-force “LLM-Direct” method where they stuffed all the tool names into the prompt of a large model, a vanilla LLM-based decomposition without SAD, and a ReAct-style agent loop.
The experiments indicate that task decomposition is the main bottleneck. Standard LLM behavior falls short when dealing with large tool libraries, but the SAD feedback loop dramatically moves the needle. In the vanilla setup, the 7B model achieved a decomposition accuracy (i.e., predicting the correct number of steps) only 51.0% of the time. By activating the SAD feedback loop, accuracy jumped to 67.7% (with the larger Qwen-Max model, the accuracy reached 92%). On “hard” tasks requiring four to five distinct skills, SAD improved accuracy by 50%.
One fascinating finding was that larger models can actually perform worse when unguided. When tested in the vanilla setup, a larger 14-billion parameter model saw its accuracy plummet below the 7B model’s accuracy because it tended to over-decompose tasks into microscopic, unnecessary steps. Once SAD was introduced, the retrieved tool hints anchored the model back to reality and increased its accuracy. This suggests that aligning an agent with the vocabulary of specific tools is often more impactful than paying for a larger, more expensive LLM.
Another important takeaway is token savings. The LLM-Direct baseline, which used the very large Qwen-Max model, showed that feeding all tools into the prompt of a large model fails. Despite near-perfect task breakdown capabilities, the massive model only retrieved the right tool category 21.1% of the time when flooded with tool options. SkillWeaver’s targeted retrieve-and-route approach vastly outperformed this in accuracy while slashing context window consumption from an estimated 884,000 tokens down to roughly 1,160 tokens per query, a 99.9% reduction. For practitioners, this translates directly to drastically lower API costs and faster response times.
Finally, the traditional ReAct baseline completely failed, achieving 0% decomposition accuracy. Its loop naturally collapses multi-step plans into isolated actions rather than explicitly mapping out a cohesive, multi-tool sequence.
While the researchers have not yet released the source code for SkillWeaver, their work was built on off-the-shelf tools that can easily be reproduced.
Skill-Aware Decomposition (SAD), which is the key innovation at the heart of the framework, is a clever prompt-engineering and retrieval loop. The authors have shared the prompt templates in their paper, and developers can implement it themselves quite easily using standard orchestration libraries like LangChain, LlamaIndex, or even raw Python scripts.
As for the retrieval component, the authors built the core framework using all-MiniLM-L6-v2, an open-source embedding model. They found that swapping in a slightly stronger off-the-shelf encoder (BGE-base-en-v1.5) immediately boosted accuracy without any fine-tuning. While an off-the-shelf bi-encoder is great at getting a relevant tool into the top 10 candidates nearly 70% of the time, it struggles to consistently rank the perfect tool at exactly number one, achieving that only about 37% of the time. To bridge this gap, teams will likely need to implement a secondary cross-encoder or LLM-based reranker to re-order those top 10 candidates.
One upfront preparation requirement is vectorizing the tool library and building a FAISS index in advance. In practice, this is a negligible hurdle. Embedding and indexing all 2,209 skills in the benchmark took a mere 15 seconds. Once built, retrieving tools from the index adds less than 15 milliseconds of latency per query. For enterprise environments, syncing the tool index is a trivial background job.
A current limitation in SkillWeaver is the lack of error recovery. While SkillWeaver successfully maps out a compatible DAG for execution, the authors’ pilot study revealed the challenges of multi-step tool chains. For example, if an API call fails in step two, the entire chain breaks. The paper’s core contribution is limited to the routing and planning phase. For a true production deployment, practitioners must build their own error recovery, fallback, and retry mechanisms on top of the compose stage to handle real-world API timeouts or malformed outputs.
As enterprise AI systems scale to handle complex workflows, practitioners face the challenge of routing subtasks to the right tools and skills. Agents can have hundreds of tools and skills and get confused on which one to use for each step of a workflow.
To address this challenge, researchers at Alibaba developed SkillWeaver, a framework that creates an execution graph for a given task and chooses the right skills for each of the nodes. They also introduce Skill-Aware Decomposition (SAD), a novel technique that uses a feedback loop to enable the agent to fetch and vet relevant tool candidates iteratively. This compositional approach and feedback loop mechanism distinguishes SkillWeaver from other tool-routing frameworks that choose tools in a one-shot fashion.
SkillWeaver relates to real-world AI applications where agents autonomously orchestrate multi-tool ecosystems, such as the Model Context Protocol (MCP), to execute multi-step business operations like downloading datasets, transforming information, and creating visual reports.
In practice, the researchers’ experiments with SkillWeaver show that implementing this retrieve-and-route approach significantly increases accuracy while reducing token consumption by over 99% compared to naively exposing agents to an entire tool library.
For practitioners building AI agents, the main takeaway is that the granularity of task decomposition is the biggest bottleneck to accurate tool retrieval.
Skills are a key pattern in modern LLM agent architectures. A skill is a modular, reusable tool specification that uses structured natural language documentation.
As enterprise agents integrate with massive tool ecosystems, accurately routing user queries to the right skills becomes a difficult task. Exposing an entire library to an LLM to find the right tool is highly inefficient, quickly overwhelms context limits, and consumes hundreds of thousands of tokens.
Most current tool-use frameworks attempt to solve this through API retrieval, documentation matching, or hierarchical structures that treat routing strictly as a single-skill selection or per-step problem.
However, this single-skill paradigm is insufficient for enterprise environments because real-world queries are inherently compositional. A standard business request such as “Download the dataset, transform it, and create visual reports” cannot be fulfilled by one tool. It requires breaking the prompt down and sequencing an API client, a data processor, and a visualization tool into a cohesive, multi-step execution plan.
To tackle this, the researchers frame the problem of handling complex tasks that require multiple skills as “compositional skill routing.” Given a complex user prompt and a vast library of tools, an agent must simultaneously figure out how to break the request into a sequence of atomic sub-tasks, how to map each sub-task to the single best available skill, and how to compose those skills into an executable plan.
SkillWeaver orchestrates this process through three distinct stages: Decompose, Retrieve, and Compose. In the first stage, an LLM acts as a task decomposer, breaking the user’s complex query down into a sequence of sub-tasks that each require one skill. Once the sub-tasks are clearly defined, the system uses an embedding model to compare each subtask against the skill library to pull a shortlist of the top candidate tools for each step.
In the final stage, a planner evaluates the retrieved candidates based on how well they work together. It checks for inter-skill compatibility to ensure the outputs of one tool naturally flow into the inputs of the next. It then creates a final execution plan as a Directed Acyclic Graph (DAG) that maps out dependencies so independent tasks can potentially execute in parallel.
For example, consider a user asking an AI agent to “Download the dataset, transform it, and create visual reports.” In the decompose stage, the decomposer LLM breaks this into three distinct sub-tasks: downloading the dataset, transforming the data, and creating the reports.
In the retrieve stage, the system searches the library and finds candidates like “api-client” or “http-fetch” for task one, “csv-parser” or “etl-pipeline” for task two, and so on. Finally, the compose stage evaluates these options, selects the specific combination of “api-client,” “csv-parser,” and “chart-gen” that are most compatible, and wires them together into a final, ready-to-execute workflow.
A key challenge of this pipeline is that LLMs often produce generic step descriptions that fail to match the specific, technical vocabulary of the actual skills available in the library. To fix this, SkillWeaver introduces Iterative Skill-Aware Decomposition (SAD), a novel feedback loop. SAD works by having the LLM draft an initial plan, conducting a preliminary search to find loosely matching skills, and then feeding those retrieved skills back into the LLM as hints. This allows the LLM to rewrite its decomposition so the granularity and vocabulary perfectly align with the actual tools that exist.
To evaluate how SkillWeaver performs in realistic enterprise scenarios, the researchers created a custom benchmark called CompSkillBench. It consists of 300 multi-step queries of different difficulty levels. To mirror real-world environments, they used a library of 2,209 real-world skills sourced from the public MCP ecosystem, covering 24 functional categories like cloud infrastructure, finance, and databases.
For the core engine, the researchers primarily used a lightweight 7-billion parameter model (Qwen2.5-7B-Instruct) for task decomposition, paired with a standard semantic search retriever (MiniLM with a FAISS index) to find the tools. SkillWeaver was evaluated against three main setups: a brute-force “LLM-Direct” method where they stuffed all the tool names into the prompt of a large model, a vanilla LLM-based decomposition without SAD, and a ReAct-style agent loop.
The experiments indicate that task decomposition is the main bottleneck. Standard LLM behavior falls short when dealing with large tool libraries, but the SAD feedback loop dramatically moves the needle. In the vanilla setup, the 7B model achieved a decomposition accuracy (i.e., predicting the correct number of steps) only 51.0% of the time. By activating the SAD feedback loop, accuracy jumped to 67.7% (with the larger Qwen-Max model, the accuracy reached 92%). On “hard” tasks requiring four to five distinct skills, SAD improved accuracy by 50%.
One fascinating finding was that larger models can actually perform worse when unguided. When tested in the vanilla setup, a larger 14-billion parameter model saw its accuracy plummet below the 7B model’s accuracy because it tended to over-decompose tasks into microscopic, unnecessary steps. Once SAD was introduced, the retrieved tool hints anchored the model back to reality and increased its accuracy. This suggests that aligning an agent with the vocabulary of specific tools is often more impactful than paying for a larger, more expensive LLM.
Another important takeaway is token savings. The LLM-Direct baseline, which used the very large Qwen-Max model, showed that feeding all tools into the prompt of a large model fails. Despite near-perfect task breakdown capabilities, the massive model only retrieved the right tool category 21.1% of the time when flooded with tool options. SkillWeaver’s targeted retrieve-and-route approach vastly outperformed this in accuracy while slashing context window consumption from an estimated 884,000 tokens down to roughly 1,160 tokens per query, a 99.9% reduction. For practitioners, this translates directly to drastically lower API costs and faster response times.
Finally, the traditional ReAct baseline completely failed, achieving 0% decomposition accuracy. Its loop naturally collapses multi-step plans into isolated actions rather than explicitly mapping out a cohesive, multi-tool sequence.
While the researchers have not yet released the source code for SkillWeaver, their work was built on off-the-shelf tools that can easily be reproduced.
Skill-Aware Decomposition (SAD), which is the key innovation at the heart of the framework, is a clever prompt-engineering and retrieval loop. The authors have shared the prompt templates in their paper, and developers can implement it themselves quite easily using standard orchestration libraries like LangChain, LlamaIndex, or even raw Python scripts.
As for the retrieval component, the authors built the core framework using all-MiniLM-L6-v2, an open-source embedding model. They found that swapping in a slightly stronger off-the-shelf encoder (BGE-base-en-v1.5) immediately boosted accuracy without any fine-tuning. While an off-the-shelf bi-encoder is great at getting a relevant tool into the top 10 candidates nearly 70% of the time, it struggles to consistently rank the perfect tool at exactly number one, achieving that only about 37% of the time. To bridge this gap, teams will likely need to implement a secondary cross-encoder or LLM-based reranker to re-order those top 10 candidates.
One upfront preparation requirement is vectorizing the tool library and building a FAISS index in advance. In practice, this is a negligible hurdle. Embedding and indexing all 2,209 skills in the benchmark took a mere 15 seconds. Once built, retrieving tools from the index adds less than 15 milliseconds of latency per query. For enterprise environments, syncing the tool index is a trivial background job.
A current limitation in SkillWeaver is the lack of error recovery. While SkillWeaver successfully maps out a compatible DAG for execution, the authors’ pilot study revealed the challenges of multi-step tool chains. For example, if an API call fails in step two, the entire chain breaks. The paper’s core contribution is limited to the routing and planning phase. For a true production deployment, practitioners must build their own error recovery, fallback, and retry mechanisms on top of the compose stage to handle real-world API timeouts or malformed outputs.
Most enterprise AI deployments so far have focused on coding assistants and customer service bots. Morgan Stanley has deployed agents in one of banking’s most accuracy-critical, deadline-driven workflows instead — profit and loss (P&L) reconciliation — and cut the work in half. The counterintuitive part: it got there by making the system less autonomous, not more.
Humans stay tightly in the loop, and their decisions are iteratively turned into repeatable rules the system can apply on its own.
“It’s much more like a co-worker than a copilot,” Morgan Stanley Managing Director Todd Johnson said at a recent VB AI Impact event. The internal production agentic system, known as FIXR, goes beyond simple, straightforward “gen AI 1.0” tasks. “We think that’s where the opportunity is to really unlock more complex work in the organization.”
Every trading day, Morgan Stanley’s trade desks handle the important work around transactions such as cash equities or debt investments.
And, at the end of each of those days, controllers must reconcile P&L across the finance giant’s Finance, Risk, Operations, and Trade Capture systems. All that data must come together, and, perhaps not surprisingly, hundreds of thousands of attributes frequently fail to match.
Typically, this means controllers must manually investigate each mismatch (or “break”), make decisions on adjustments, then ideally sign off before the number goes to the desk. And all of this while working on a hard morning deadline.
Previously, this could take up to six hours for a single book. Now, FIXR performs the task in two to three hours, Johnson said. Across the roughly 100 controllers who do this work, that adds up to about 1,500 hours saved per week.
After nightly P&L calculations complete, the system automatically analyzes “breaks” and proposes resolutions based on learned rules. Several agents work together:
One interprets past guidance to develop start-of-day resolutions.
One learns from controller behavior and documents the rules they apply.
One converts repeated patterns into durable, automated logic.
Over time, the system can auto-clear certain breaks it’s encountered before, suggest solutions for others that may be less familiar, ask for help when it’s unsure, and flag for human investigation. When items are repeatedly resolved through the same method, it can create firm rules.
Critically, humans don’t leave the loop, but stay fully in it, he said. They review, approve or correct every recommendation, then feed those decisions back to improve the next run. The agent learns daily from controllers what it gets right and wrong and codifies that knowledge as it iterates.
“You still preserve that element of human accountability even as you start to automate,” Johnson said. “Over time you’ll see more and more of those items resolved in an automatic way.”
He emphasized that autonomy requires a great deal of trust; enterprises will not see efficiency gains if everyone’s checking everything an agent does.
The human–agent feedback loop was critical to addressing the challenge of controlled, measured, and repeatable automation. “We recognized that all that intelligence that’s sitting in the mind of a controller is gonna be difficult to get all into an agent on day one,” Johnson said.
It was critical to establish processes first, before getting any AI involved, Johnson said. His team ran a “very thorough” process intelligence assessment that mapped and mined workflows to identify where automation would be the most advantageous: Was the answer agents, traditional automation, or simple re-engineering of an inefficient step?
“If we can fix that first before we add agents to the problem, then we really will be transforming the opportunity,” he said.
The P&L sign-off process was full of manual steps suitable for automation, and agents taking over some of these time-consuming tasks are freeing up controllers for “more value-added analysis” and “deeper risk consideration” work, he said.
Extensibility, though, was just as important as time savings. Johnson’s team chose this particular P&L reconciliation use case because hundreds of controllers were doing this work globally across the business (in the Americas, Europe, Asia).
So start with a use case, prove it, extend it, “and then ultimately the transformation will be as we roll this out more and more across the organization,” Johnson said.
Johnson said the team also deliberately limited how much of the workflow depended on the model’s judgment at all. “If you have an opportunity to make things very prescribed and repeatable, that’s cheaper in terms of token consumption, it’s more repeatable in terms of controls — and have the LLM do the stuff where you don’t need that kind of deterministic workflow,” he said.
As the system sees more controller feedback on a given break type, Morgan Stanley converts that pattern into a fixed rule instead of leaving it to the model.
An interesting (and perhaps fundamental) question being raised at the dawn of the agentic era is: Are agents code or digital employees?
Johnson argues that “they’re probably a little bit of both,” and, as such, require nuance when it comes to governance and oversight. Technical teams must still be responsible for maintaining protections and guardrails like firewalls or encryption, for instance.
But there’s a new dynamic around the “performance element”: Humans using agents are responsible for them because it’s aiding their business work. For instance, if a senior controller is working with a junior controller, they don’t just relinquish responsibility because someone is helping them out, Johnson noted.
“One of our strong principles in our AI governance generally is that there always has to be human accountability, even if there’s a degree of automation,” he said.
But there typically isn’t “one single one person,” and the process is ultimately continuous. To this point, Johnson joked that one “depressing” thing about agentic AI is that it’s going to require ongoing training because models are ever-changing.
“You’re never gonna be able to say: ‘We’ve done all the evaluation and testing that we need to do. Let’s just let it go.’ You’re going to have to have a constant view as it evolves over time.”
Morgan Stanley’s experience mirrors patterns VentureBeat has uncovered across enterprise AI deployments.
In VentureBeat’s recent VB Pulse survey, nearly three-quarters of respondents reported seeing little to no ROI from custom model fine-tuning, describing a “sandbox graveyard” of AI projects that proved too costly to maintain. This suggests that Morgan Stanley’s process-first, buy-and-blend approach may be more sustainable than chasing bespoke models. The survey had 87 respondents and findings should be considered directional.
Governance emerged as another common challenge: 38% of respondents cited the lack of a single accountable owner as their biggest barrier to production AI, while only two of the 87 enterprises surveyed had active monitoring and alerting in place to detect model failures.
Even as the geopolitical conversation around AI continues to grow more fraught following the U.S. government’s actions to limit the new models from Anthropic and OpenAI, Chinese open source darling DeepSeek is back with yet another open release that could once again change AI development around the globe.
Over the weekend, the firm released DSpark, a new, MIT-Licensed system designed to make large language models answer faster without changing what the underlying model is trying to say.
The easiest way to think about it is this: most AI chatbots write like someone crossing a river one stepping stone at a time. They choose one small chunk of text, then the next, then the next.
DSpark gives the system a scout that runs a few steps ahead, guesses the likely path, and lets the larger model quickly check which steps are safe. When the guesses are good, the model moves faster. When the guesses are weak, DSpark tries not to waste time checking them.
DeepSeek published the work with a technical paper, model checkpoints and DeepSpec, a codebase for training and evaluating speculative decoding systems. The release is available through DeepSeek’s public GitHub and Hugging Face pages, both under the permissive, friendly, commonplace MIT license, making the new technique broadly usable by developers, researchers and commercial enterprise operations that want to study or adapt the approach.
The system is aimed at one of the most expensive problems in AI deployment: serving large models quickly enough for real users, while using hardware efficiently enough to make the economics work. That matters for consumer chatbots, coding assistants, agentic workflows and enterprise AI systems where users expect long answers to stream quickly rather than crawl out word by word.
DeepSeek is applying DSpark to its own latest frontier open model, DeepSeek-V4.
Specifically, DeepSeek used its new DSpark framework on DeepSeek-V4-Flash, its already speed-optimized 284-billion-parameter mixture-of-experts model with 13 billion active parameters, and DeepSeek-V4-Pro, its more thoughtful and powerful 1.6-trillion-parameter model with 49 billion active parameters (Both support context windows up to one million tokens).
But the broader significance is that DSpark is not conceptually limited to DeepSeek-V4. DeepSeek’s own tests and released checkpoints cover other open model families, including Alibaba’s open weights Qwen and Google’s open weights Gemma.
That means enterprise teams running open-weight models could, in principle, train or fine-tune DSpark-style draft modules for their own target models. It is not a switch that any API customer can flip from the outside, but it is a method that can travel to other models when the operator controls the weights and serving stack.
In DeepSeek’s live production tests, DSpark improved aggregate throughput by 51% for DeepSeek-V4-Flash at an 80-token-per-second-per-user service target, and by 52% for DeepSeek-V4-Pro at a 35-token-per-second-per-user target. At matched system capacity, DeepSeek reports per-user generation speedups of 60% to 85% for V4-Flash and 57% to 78% for V4-Pro over its prior MTP-1 production baseline.
The different speed claims measure different things. The 60% to 85% figure for V4-Flash, and the 57% to 78% figure for V4-Pro, describe how much faster individual users receive generated tokens when DeepSeek compares DSpark with MTP-1 at matched practical system capacity.
Those are the cleaner “generation speed” numbers. DeepSeek also reports much larger 661% and 406% increases, but these measure aggregate throughput under very strict speed targets: 120 tokens per second per user for V4-Flash and 50 tokens per second per user for V4-Pro.
At those targets, DeepSeek says its older MTP-1 baseline approaches an operational cliff, meaning it can keep only a small number of concurrent requests running while preserving that level of responsiveness.
DSpark avoids more of that collapse, so the percentage difference in total system output becomes much larger. Put simply: the 85% number is closer to “how much faster the ride feels for a user” under comparable conditions, while the 661% and 406% figures are closer to “how much more traffic the road can still carry” when the old system is already bottlenecking.
LLMs usually generate text one token at a time. A token can be a word, part of a word, punctuation mark or other small piece of text. Every new token depends on the text already produced, so the model has to keep pausing, checking the full context and choosing the next piece.
That is accurate, but slow. It is like having a senior editor approve every word before a writer can move to the next one. The editor may be excellent, but the process creates a bottleneck.
Speculative decoding, developed in the early Transfomer era, tries to fix that bottleneck. Instead of asking the large model to produce every token one by one, the system uses a smaller or lighter draft component to suggest several likely next tokens. The large model then checks that batch of guesses in parallel. If the draft guessed correctly, the system moves ahead several tokens at once. If the draft made a bad guess, the system rejects the bad token and anything after it, adds a corrected token, and tries again.
The point is speed without changing the larger model’s intended output. In the standard speculative decoding setup, the draft model is not replacing the target model. It is acting more like an assistant who prepares a rough next sentence for the senior editor to approve or reject.
The idea did not appear out of nowhere with today’s large language models. A key precursor came in 2018, when Mitchell Stern, Noam Shazeer and Jakob Uszkoreit proposed blockwise parallel decoding for deep autoregressive models. Their method predicted multiple future steps in parallel, then kept the longest prefix validated by the main model. That paper established much of the draft-and-check intuition behind later speculative decoding work.
The research line became more explicit in 2022. Heming Xia, Tao Ge and co-authors introduced SpecDec, a draft-and-verify approach for sequence-to-sequence generation. Later that year, Yaniv Leviathan, Matan Kalman and Yossi Matias posted “Fast Inference from Transformers via Speculative Decoding,” which helped define the modern version of the technique for transformer-based language models. DeepMind researchers followed in 2023 with a closely related method called speculative sampling.
Those 2022 and 2023 papers are the clearest ancestors of how speculative decoding is discussed in current LLM inference work: a faster draft process proposes tokens, and the larger target model verifies them in a way designed to preserve the target model’s output distribution.
Since then, the field has moved quickly through several variants, including separate draft models, multi-token prediction heads, tree-based verification, feature-level methods such as EAGLE, self-speculation, Medusa-style extra heads and parallel/blockwise drafters such as DFlash.
The key metric is not how many tokens a draft model can guess. It is how many of those guesses the larger model actually accepts. Long speculative blocks help only if enough of the proposed tokens survive verification. Otherwise, the system spends compute checking guesses that it throws away.
That is the context for DSpark. Speculative decoding is already an established inference technique before DeepSeek’s release, with support in major serving stacks and multiple competing research approaches. But it is still not a solved problem. Speedups depend heavily on the draft model, the workload, the serving setup and the current traffic level. DSpark’s contribution is to improve both sides of the trade-off: it tries to draft more coherent token blocks and then verify only the parts of those blocks that are likely to pay off under real serving conditions.
DSpark tackles two related problems: bad guesses and wasted checking.
First, the system uses what DeepSeek calls semi-autoregressive generation. In plain English, that means DSpark tries to combine speed with a bit more awareness of sequence.
A fully parallel drafter can guess several tokens at once, which is fast, but its later guesses can become less coherent because each position is predicted too independently. A purely step-by-step drafter can keep better track of how one token leads to the next, but it loses much of the speed advantage.
DSpark tries to keep the best of both. It uses a parallel backbone for most of the drafting work, then adds a lightweight sequential head that lets the draft take nearby token relationships into account. In the paper’s example, a parallel drafter might confuse likely phrase endings such as “of course” and “no problem,” producing awkward combinations because it is guessing positions too separately. DSpark’s sequential component helps the system make the later tokens fit the earlier ones.
Second, DSpark adds confidence-scheduled verification. Rather than always asking the target model to check the same number of draft tokens, DSpark estimates which prefix of the draft is likely to survive. A hardware-aware scheduler then adjusts how much of each draft should be verified based on both model confidence and current serving load.
A simple analogy: when a restaurant is quiet, the head chef can inspect more of the prep cook’s work. When the kitchen is slammed, the chef spends attention only on the dishes most likely to be ready. DSpark applies a similar idea to AI serving. Under lighter traffic, the system can afford to check longer draft prefixes. Under heavier traffic, it trims low-confidence trailing guesses before they consume batch capacity that could be used for other users.
DeepSeek frames this as an answer to a common production trade-off. Static multi-token drafting can look attractive in isolation, but can hurt throughput under high concurrency because the system keeps checking tokens that are likely to be rejected. DSpark’s scheduler makes the verification budget flexible instead of fixed.
DeepSeek tested DSpark offline on Qwen3-4B, Qwen3-8B, Qwen3-14B and Gemma4-12B target models across math, coding and chat benchmarks.
In those tests, the team compared DSpark with DFlash, a parallel drafter, and Eagle3, an autoregressive drafter. The paper reports accepted length per decoding round, a measure of how many tokens survive verification on average.
Across the three Qwen3 model sizes, DSpark improved macro-average accepted length over Eagle3 by 30.9%, 26.7% and 30.0%, respectively. Compared with DFlash, it improved accepted length by 16.3%, 18.4% and 18.3%. The paper also says the gains generalized to Gemma4-12B.
That supports a point raised by developer Daniel Han, who highlighted on X that DeepSeek showed DSpark working beyond DeepSeek’s own V4 models, including Gemma and Qwen. I would include Han as community reaction, not as the sole evidence for the claim. The stronger support comes from DeepSeek’s own benchmarks and released checkpoints.
The offline results also show why workload matters. Structured tasks such as math and code tend to have higher accepted lengths than open-ended chat. That makes intuitive sense: a code completion or math step often has fewer reasonable next moves than a free-form conversation.
For enterprises, this means DSpark-style methods may be especially attractive for coding assistants, data analysis agents, structured workflow automation and other settings where outputs follow more predictable patterns.
One of the most important questions is whether DSpark is a DeepSeek-only optimization or a broader method that can be applied to other models. The answer is: broader method, but not automatic plug-in.
For open-weight models, the path is relatively clear. An enterprise running Qwen, Gemma, Llama, Mistral, Granite, Command-style open weights or another model it hosts itself could train or fine-tune a DSpark-style draft module against that target model.
The team would then measure acceptance on its own workloads and integrate the verification scheduler into its inference stack.
That is different from simply downloading DeepSeek’s DSpark module and attaching it to any model. Speculative decoding depends on alignment between the draft module and the target model. The draft has to learn what the target model is likely to accept. A drafter trained for DeepSeek-V4 will not automatically be the right drafter for a different model, especially one fine-tuned on a company’s internal data or configured for different reasoning behavior.
DeepSpec’s workflow reflects this. The process involves preparing data, regenerating target-model answers, building a target cache, training the draft model and evaluating speculative-decoding acceptance. For domain-specific use, the draft model may need additional fine-tuning, especially if the target model runs in a thinking or reasoning mode.
For proprietary models, the answer depends on what the enterprise controls. If a company owns or fully hosts the model weights and serving stack, it could theoretically train and deploy a DSpark-style drafter. If the model is available only through a hosted API from a vendor, the customer cannot directly add DSpark from the outside. The API provider could implement a similar optimization internally, but the customer generally cannot access the token verification loop, logits, batching behavior or serving scheduler needed to make DSpark work.
That distinction matters for enterprise buyers. DSpark strengthens the case for open or self-hosted AI infrastructure because it gives advanced teams another lever to improve speed and cost. But it also shows why model serving is becoming a specialized discipline. The value is not just in picking a model, but in how intelligently that model is run.
For developers, DeepSpec gives a concrete implementation path for training and evaluating speculative decoding draft models. It includes data preparation, training and benchmark evaluation steps, along with released checkpoints for several open model families. That makes the release useful not only for running DeepSeek-V4 with DSpark, but also for researchers and infrastructure teams studying how to add faster decoding to other open models.
There are real deployment caveats. DeepSpec’s own README says the default Qwen3-4B data preparation setup can require roughly 38 TB of target cache storage, and the default scripts assume a single node with eight GPUs. That makes the release more immediately relevant to AI labs, cloud teams and sophisticated enterprise AI infrastructure groups than to ordinary application developers.
Still, releasing the training pipeline matters. Many inference optimizations appear only as papers, vague benchmarks or closed production claims. DeepSpec gives developers something closer to a set of blueprints: not a finished enterprise product, but a way to reproduce, adapt and evaluate the method.
The release has already drawn fast developer attention. Developer Rafael Caricio published a GitHub pull request documenting single-stream DeepSeek-V4-Flash DSpark work, reporting warmed benchmark anchors of 26.33 tokens per second without speculative decoding, 39.88 tokens per second with MTP-1, and roughly 60 tokens per second with DSpark — about 1.5x over MTP-1 and 2.3x over no-spec decoding.
A later commit in the same thread recorded a five-run mean of 60.31 tokens per second, with a 1.51x gain over MTP-1 and 2.29x over non-speculative decoding.
The same work also points to an important practical limit: in realistic multi-turn coding sessions, performance can degrade as draft acceptance falls with growing context. In other words, DSpark can make decoding faster, but acceptance quality still determines how much speed the system actually realizes.
That is a useful reality check. DSpark is not magic. It still depends on how predictable the next tokens are and how well the drafter stays aligned with the target model. But the early implementation work suggests DeepSeek’s claims are not purely academic. Developers are already testing the method in practical serving environments and reporting gains close to the paper’s single-stream expectations.
DSpark shows how much performance remains available in the inference layer, even when the underlying model architecture stays the same. As AI companies compete on model quality, context length and pricing, decoding efficiency is becoming another major battleground.
Faster generation means lower latency for users, higher throughput for providers and better economics for teams serving open models at scale.
DeepSeek’s release is notable because it combines a production-tested method, open code, public checkpoints and a detailed paper. The main innovation is not just drafting more tokens. It is making the system more selective about which speculative work is worth verifying.
For enterprise teams, the broader lesson is that the next wave of AI performance gains will not come only from larger models. It will also come from smarter ways to run the models companies already have — especially when those companies control enough of the stack to tune the model, train a compatible draft module and optimize the serving engine around real workloads.
Long-horizon reasoning exposes a core weakness in AI agents: context windows fill up fast, and retrieval pipelines return noise instead of signal.
To solve this, researchers at the National University of Singapore developed MRAgent, a framework that abandons the static “retrieve-then-reason” approach. Instead, it uses a mechanism that allows an agent to dynamically develop its memory based on accumulating evidence.
This multi-step memory reconstruction is integrated into the reasoning process of the large language model (LLM). While not the only framework in this space, MRAgent significantly reduces token consumption and runtime costs compared to other agentic memory management approaches.
In classic retrieval pipelines, documents are retrieved through vector search or graph traversal and passed on to an LLM for reasoning. This passive approach fails because it cannot combine reasoning with memory access, creating three major bottlenecks:
These systems cannot revise their retrieval strategy mid-reasoning. If an agent fetches a document and discovers a crucial missing cue — a specific date or person — it has no way to issue a new query based on that finding.
Fixed similarity scores and predefined graph expansions return surface-level matches that flood the LLM’s context window with irrelevant noise, degrading reasoning.
Current systems rely heavily on pre-constructed structures such as top-k results and static relevance functions, limiting the flexibility required to scale across unpredictable, long-horizon user interactions.
The researchers argue that to overcome these limitations, developers must shift toward an “active and associative reconstruction process,” a concept inspired by cognitive neuroscience.
Under this paradigm, memory recall unfolds sequentially rather than operating as a passive read-out of a static database. The system starts with small, specific triggers from the user’s prompt, such as a person’s name, an action, or a place. These initial hints point to connecting concepts or categories instead of massive blocks of text.
By following these metadata stepping stones, the agent gathers small pieces of evidence one by one. It uses each new piece of information to guide its next step until it successfully pieces together the full, accurate story.
Instead of viewing memory as a static database, MRAgent (Memory Reasoning Architecture for LLM Agents) treats it as an interactive environment. When processing a complex query, the agent uses the backbone LLM’s reasoning abilities to explore multiple candidate retrieval paths across a structured memory graph.
At each step, the LLM evaluates the intermediate evidence it has gathered and uses it to iteratively optimize its search. It infers new search constraints, pursues the paths with the best information, and prunes irrelevant branches. This allows MRAgent to piece together deeply buried information without filling the LLM’s context with noise.
To make this active exploration computationally efficient and scalable, the framework organizes its database using a “Cue-Tag-Content” mechanism. This operates as a multi-layered associative graph with three node types:
Cues: Fine-grained keywords, such as entities or contextual attributes extracted from user interactions.
Content: The actual stored memory units. These are divided into multi-granular layers, such as episodic memory for concrete events and semantic memory for stable facts and user preferences.
Tags: Semantic bridges that summarize the relational associations between specific Cues and Content.
This structure enables a highly efficient two-stage retrieval process. The LLM first navigates from Cues to candidate Tags. Because Tags explicitly expose the semantic relationships and structural associations of the data, the agent evaluates these short summaries to judge their relevance. The LLM identifies promising traversal paths and discards irrelevant branches before spending compute and prompt tokens to access the detailed, heavy memory contents.
For example, a user might ask an AI agent, “How did Nate use the prize money when he won his third video game tournament?”
MRAgent first extracts fine-grained starting cues from the prompt, such as “Nate,” “video game tournament,” and “win.”
The agent maps these initial cues to the memory graph and looks at the available associative Tags connected to them. The agent sees tags like “Tournament Victory” and “Tournament Participation.” Since it is only concerned with what the person did after they won the championship, MRAgent drops the tournament participation tag and pursues the victory tag.
The agent retrieves the episodic content linked to the chosen Cue-Tag pair, retrieving three distinct memory episodes where Nate won a tournament.
MRAgent looks at the three memories, decides one of them in particular is relevant to the query, and discards the other two.
With this information, it updates its cues and starts another round of discovery and pruning. From the new episodic memory it has retrieved, the agent adds “tournament earnings” to its cues and uses that to traverse new tags and home in on new memories. It repeats this process until it gathers enough information to answer the query, which could be something like “Nate saved the money.”
MRAgent operates alongside several other frameworks addressing agentic memory building. Alternatives include A-MEM, a graph-based agentic memory framework, and MemoryOS, a hierarchical memory framework. Other persistent memory frameworks include LangMem and Mem0.
The researchers tested MRAgent on the LoCoMo and LongMemEval industry benchmarks. These test the abilities of agents to resolve queries on long-horizon tasks and conversations across dozens of sessions and hundreds of turns of dialogue. The backbone models used were Gemini 2.5 Flash and Claude Sonnet 4.5. The system was tested against standard RAG, A-MEM, MemoryOS, LangMem, and Mem0.
MRAgent consistently outperformed every baseline across both models and all question types by a significant margin.
However, for enterprise developers, the most critical metric is often computational cost. In the LongMemEval tests, MRAgent slashed prompt token consumption to just 118k per sample. By comparison, A-Mem consumed 632k tokens, and LangMem burned through 3.26 million tokens per query. MRAgent also effectively halved the runtime compared to A-Mem, dropping from 1,122 seconds to 586 seconds.
What makes MRAgent efficient in practice is its on-demand behavior. Evaluating tags and pruning irrelevant paths before retrieval saves money and context space. Furthermore, the system autonomously evaluates its accumulated context and inherently knows when to stop searching, completely avoiding redundant data exploration.
While MRAgent is highly effective, the Cue-Tag-Content structure needs to be prepared before the agent can query it. Developers must figure out how to architect the underlying memory database to enable the LLM to efficiently navigate associative items and prune irrelevant paths without exploding compute costs.
Fortunately, developers do not have to manually label or structure this data. The authors designed MRAgent with an automated distillation pipeline that uses LLMs to process raw interaction histories and automatically populate the memory graph. For a developer, the job is to implement and orchestrate this automated ingestion pipeline, rather than manually tag data.
You need to set up a background job or streaming pipeline that passes raw user interactions through prompt templates to extract this metadata before storing it in your graph database.
However, the authors emphasize that this is a lightweight construction phase and MRAgent intentionally keeps ingestion simple.
The authors have released the code on GitHub.
Industrialized factories changed how the world produced physical goods: more output, lower costs, faster than anything that came before. Now a similar shift is happening with software.
LLMs have lowered the barrier to writing code, increased individual output, and pushed organizations to think about software development as a production system. The standard software development lifecycle and CI/CD practices that have held for decades won’t hold up under that pressure. That’s where the software factory comes in — and like physical factories, it needs more than speed to actually work.
The idea of a “software factory” started to solidify over the past year. Luca Rossi’s “The Era of the Software Factory” made the case plainly: AI is not just changing how fast people write code — it’s changing the whole production system around software.
The concept can mean different things: a collection of coding agents and skills files; faster CI/CD; better review systems; or more automation around software delivery. A better frame is to think of it less as a tool category and more as a set of principles. A software factory can’t just be a loose collection of prompts, agents, and plugins. It needs a platform that defines how work moves through the system and how code is generated, reviewed, tested, traced, deployed, and improved when something goes wrong.
Otherwise all you’re doing is putting yet another one-off machine into an empty room and calling it a factory.
There are a few forces all hitting at the same time.
Companies have always wanted more software than engineers can produce. That’s why tools like Excel exist: They often fill in the gap for a lot of the software that many companies wish they could make.
AI has also lowered the barrier of entry to creating code, and this is the part everyone focuses on. Code creation is now easier, though not always cheaper or better, as evidenced by many high-profile companies fretting over their high AI bills. The barrier to writing functional code has effectively collapsed.
More importantly, a single engineer can generate more code than they could just a few years ago. That changes the bottleneck: it’s no longer “How fast can someone write this?” or even, in some cases, “Can someone understand how to code?” Instead it becomes, “Should this be written?”
More importantly, can we actually create end products that are durable and reliable and don’t just build tech debt? Or are we just putting out more AI slop faster than ever? That’s where the danger lies.
All of this sounds great. Factories, after all, made production faster and more consistent.
They made it possible to build more cars and products, less expensively, which led to more people being able to afford cars and products. Putting environmental impacts aside, you could argue this was positive.
But like many things in engineering, there are always tradeoffs, and in this case, there are new risks.
When you increase the output of one person with machinery, digital or otherwise, you also increase the mistakes that can be made either by the individual or the machinery. The speed at which code can now be put out is on an industrial scale. Even smaller organizations can suddenly have code bases ballooning up to the size of tech company code bases a decade ago.
The data is already showing problems. Faros AI found that while task throughput per developer is up 33.7% and PR merge rate is up 16.2%, the incidents-to-PR ratio has risen 242.7% and bugs per developer are up 54%. Google’s DORA research found that more AI adoption was actually associated with worse delivery stability.
As a fractional head of data, I’ve been brought in to fix these exact issues. In the past year alone, I’ve worked on two projects where AI-generated data infrastructure slowly started to morph over time.
Between multiple engineers trying to move quickly and a lack of standards, these projects became unruly. Code bases tend to go through some level of evolution, but as different styles blend, the LLMs in turn start to create their own mutations. Codebases developed five to six different styles within months — a process that previously took years. Layer by layer, the engineers would slowly stop understanding exactly what was going on.
The pattern echoes what happened a decade ago with self-service tooling: early productivity gains that masked downstream complexity.
And that’s why the software factory can’t just be about speed.
There are several key principles to consider when building a software factory.
Platform over tools: Many teams are slowly implementing AI into their coding workflows at the edges — adding a PR review agent or a skills file into their repos. But building an actual software factory requires a platform, not a collection of tools at the edges. A platform provides a unified foundation where tools aren’t scattered in separate corners. Instead, they actively share data, talk to each other, and work as a single cohesive system — standards, processes, and the work itself all connected.
Rerunability and traceability: A real platform requires the ability to go back into any run, identify what went wrong, and rerun it — which is why one-off agents don’t make a factory. The system needs to support taking a serial ID, looking it up, and tracing exactly how it got to the output it produced. This is why state machines make more sense than loops for AI workflows: they make it far easier to rerun a process and understand what happened at each step.
Safety and guardrails: Factories are not safe places. Neither is a software factory. As more people develop on these platforms, better guardrails and safety measures need to be built in. Testing and quality control need to be pushed to the front of the process — catching bugs at the lowest possible stage reduces the cost to fix them and limits the blast radius.
Standardization: At the enterprise level, every codebase has its own flavor. Layering a code assistant on top without standards produces an amalgamation of styles. Standardization has to be built into the process from the start.
Quality control: In older manufacturing models, quality control happened at the end of the line. The product was built, inspected, defects found, and fixed later. Toyota’s approach was different. Quality was pushed into the process itself — workers were expected to stop the line when something was wrong. The goal wasn’t to catch defects at the end; it was to prevent them from flowing downstream in the first place.
The same is true for the software factory. QC needs to be baked into the entire process, starting with how the spec is written. That means integrating static code analysis that catches obvious errors and providing templates to LLMs so they know the structure the code should follow. Without that, the bottleneck becomes the final review — or teams just push out more AI slop.
Improving the speed of your code output is not actual productivity if the downstream issues aren’t managed. A company is not more productive because it produces millions of cars, only to see them all fall apart within 100 miles. It’s also not more productive if all it does is produce an endless stream of proofs-of-concept that never enter production.
Actual productivity is when the software factory takes ephemeral tokens and turns them into durable outputs. It’s easy to talk about lines of code and how much faster your team is moving.
The software factory that wins isn’t the one that generates the most code. It’s the one that generates the fewest defects downstream.