MassMutual’s AI strategy: 12-month contracts, 30% productivity gains, zero lock-in

Enterprise AI teams face a dilemma: The best models today might not be the best models a year from now. MassMutual’s answer is to stop making long-term bets — and build infrastructure that can swap models as the market shifts.

“The world of AI today is extremely dynamic,” Sears Merritt, MassMutual CIO, explained in a new VB Beyond the Pilot podcast. “We wanted to make sure we were positioned to ride that wave of dynamism.”

The strategy appears to be paying off in a big way. MassMutual has measured a roughly 30% increase in developer productivity, while AI-powered contact center workflows have reduced resolution times from 10 minutes to one and cut costs from dollars to cents.

But the broader lesson for IT leaders may be less about the results and more about how the company is thoughtfully building its AI infrastructure and keeping users at the center.

Maintaining optionality for the possibilities of tomorrow

MassMutual works with vendors at the leading edge, but keeps those relationships on a clock. “Those relationships are capped so that we maintain optionality for best-of-breed tools as things mature in this space, and at some point, settle down and stabilize,” Merritt said. 

That philosophy extends to open-source models. Merritt says his team is “100%” looking at open-source tools, and sees the technology playing a big role in how MassMutual (and similar companies) use AI. 

“We’re certainly going to need frontier models and leading edge capabilities to do what today is impossible, and tomorrow will be possible,” he said. 

Measuring outcomes from the start

MassMutual’s AI efforts fall into two broad categories.

The first focuses on enablement: Putting productivity-enhancing tools such as Copilot and virtual assistants into the hands of all employees. The second involves what Merritt describes as “deepen and focus” initiatives, where teams target a specific workflow or business process that will have a strong impact on advisors, policyholders, or employees.

Rather than focusing on adoption metrics, these projects begin with predefined success criteria. “Everything we do is measured,” Merritt said. “There’s always a success metric that we define upfront to determine whether or not we’re going to scale up some of these things.”

The company is also deliberately encouraging experimentation, giving employees access to a range of best-in-class models, “token-consumptive workflows” and other possible capabilities so they can weigh the benefits relative to “simpler, lower cost” large language models (LLMs). 

At the same time, MassMutual is collecting increasingly detailed analytics around usage patterns, developer workflows, model performance, and costs. The goal is to reduce spending while also building operational intelligence to eventually route workloads to the right model based on cost, response quality, and user experience.

Those insights will eventually drive optimization decisions around model routing, prompt selection, response times, and infrastructure design.

“We’re gaining access to analytics that let us, in a very granular way, look at usage patterns, developer workflows, and begin to make sense of who’s using what, when, and for what types of tasks,” Merritt said.

Why MassMutual sometimes chooses the more expensive model

Another interesting aspect of MassMutual’s approach is how it evaluates AI quality. Rather than focusing exclusively on benchmarks or token costs, the company uses what Merritt calls a “trust score” framework.

The process combines user feedback with operational metrics to understand how employees perceive AI-generated responses and whether those responses actually improve outcomes. 

The contact center rebuild put that framework to the test. During development, employees were given access to two different LLMs. One generated responses in near-real-time but the quality was noisier. The other more expensive option took several additional seconds to respond but consistently delivered higher-quality answers.

Conventional wisdom and the speed of business might suggest users would prefer the former; but they overwhelmingly chose quality. Merritt’s team asked users about the quality of response, their preferred model, and their overall thoughts on the experience. 

Most of the time, users said: “We want the more expensive one. We’re willing to wait, but the quality difference is so high that the two extra seconds actually is worth it to us.” 

That feedback ultimately determined which model MassMutual deployed.

“We factored that experience piece into the decision-making, and that led us to say, on a relative basis, the costs were immaterial, so we’re going to use the more complex model,” Merritt said.  

Listen to the full podcast to hear more about: 

  • Why Mythos “completely changed” the cybersecurity landscape — not the type of threats, but the rate at which those threats appear; 

  • How a team of AI engineers modernized MassMutual’s mainframe in 7 days (a process that previously would have taken 3 months); 

  • Why MassMutual specifically avoided tokenmaxxing to rein in AI use and spending and has been going “unlimited,” to shield from cost blowups. 

  • How a “multi-harness type of environment” will support agentic AI. 

You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.

AI is about to replace the interface. Business leaders aren’t ready

Presented by Snowflake


As AI agents become capable of reasoning across systems and taking action, software is evolving from something employees operate into something that understands intent. Instead of navigating disparate applications and dashboards, a single system will increasingly ask: What are you trying to accomplish?

That sounds like a user experience breakthrough. It is. But the more important implication is organizational. When software no longer relies on humans to provide context, companies can no longer assume that knowledge lives in employees’ heads or is buried inside disconnected applications. The company itself has to become machine-readable.

The winners in the AI era won’t simply deploy more intelligent models. They’ll build the data foundations, semantic context, and governance frameworks that allow machines to understand how the business works and act on that understanding with confidence.

Context is becoming infrastructure

For years, companies treated context as a human layer on top of data. The data platform held the records, then the BI tool visualized them, and the analyst interpreted them. And finally, the business leader made the judgment call. Agents collapse those layers.

When an executive asks, “Why is customer churn rising in our enterprise segment?” an effective agent needs to know far more than where the customer data lives. It needs to understand how the company defines churn, which accounts count as enterprise, whether product usage data is more reliable than survey data, which renewal events matter, what the sales team has logged, what support tickets suggest, and whether the answer differs by geography or product line.

This is why semantics — the definitions, relationships, rules, and assumptions that give data meaning — are moving from a technical concern to a boardroom issue. A semantic layer used to sound like plumbing for data teams. In an agentic enterprise, it becomes the shared language between humans and machines.

If every department teaches its own agent a different version of the business, companies will get inaccuracy at scale. The organizations that pull ahead will be the ones that create a common business knowledge base: consistent definitions, governed access, documented workflows, clear lineage, and enough flexibility to evolve as the business changes. In that world, context is treated as infrastructure, rather than just a nice-to-have.

From dashboards to decisions

The first wave of enterprise AI largely gave us assistants and copilots that answer questions. Useful, but still limited. You ask a question, get a response, and then return to the work of stitching systems together yourself.

The next era of AI will be different. Agents will move beyond coordinating answers, and start getting actual work done. A sales leader starting the day will not need to open a CRM, a forecasting tool, a support dashboard, and a Slack thread to understand what changed overnight. They will simply ask an agent what needs attention. The agent will identify which accounts are at risk, explain why, summarize recent customer interactions, draft follow-up actions, and perhaps initiate the next workflow.

The dashboard does not disappear because charts become useless. It disappears because static reporting becomes too slow for how businesses need to operate. The center of gravity shifts from “show me what happened” to “help me decide what to do next.”

The new governance problem: agents that act

As long as AI is mostly answering questions, governance is about controlling what it can access. That is already difficult. Employees have different permissions, sensitive data needs protection, and answers must be traceable to trusted sources. As agents begin taking action, governance becomes even more consequential.

It’s one thing for an agent to summarize a customer complaint. It’s another for it to issue a refund, reorder inventory, or send an email to a customer. This is where many companies will be tempted to choose between two imperfect paths.

One path is to tightly constrain agents from the start: define the data sources, tools, workflows, and actions they can access. This is easier to manage and measure. It also risks limiting the creativity of employees who understand their workflows best.

The other path is to let teams experiment freely: connect agents to the tools and data they use every day, and allow new use cases to emerge organically. This can produce faster adoption and unexpected innovation. It can also create real risk: stale data, inappropriate access, duplicated workflows, runaway costs, or automated actions no one fully understands.

The right answer is not maximum control or maximum freedom. It’s to prioritize governed flexibility. Companies need architectures where governance is embedded from the beginning. An agent should know not only what it can read, but what it can do, when it needs approval, how its reasoning is inspected, and how its performance is evaluated over time. In other words, governance cannot be a review meeting after the pilot. It has to be part of the system design.

The boundary between builder and user is collapsing

One of the least appreciated consequences of agentic AI is that it will blur the line between people who use software and people who create it. When employees can describe a workflow in natural language and have an agent help build it, software development becomes less confined to engineering teams. A marketer can create a campaign analysis workflow. A finance manager can automate variance explanations. An HR leader can build a policy assistant. A support manager can design a triage process.

These employees are not becoming software engineers in the traditional sense, but they are becoming builders. That changes the talent model. Technical fluency will matter more because employees need to understand what’s possible, what’s risky, and how to evaluate an AI-generated result. Judgment becomes the most important skill.

The winners will be the people who know how to ask better questions, inspect evidence, refine workflows, and combine domain expertise with enough technical understanding to move from idea to execution.

For business leaders, this means AI adoption extends beyond an IT rollout, and is actually an organizational redesign. The distance between insight and action will shrink, and companies will need to rethink who is empowered to build, approve, and operate the workflows that run the business.

Software economics will change too

The shift from interfaces to agents will also challenge how companies buy and measure software, and change how software is priced. Per-seat licensing is giving way to consumption models, where costs reflect actual usage. For most organizations this is a better deal. You pay for value delivered, not licenses that may sit idle.

But it also changes the accountability calculus. When costs are fixed per seat, budget conversations happen once a year. When costs scale with usage, they require continuous oversight. Without visibility into how agents are used and what they produce, costs can rise quickly.

The answer is to build measurement in from the start, connecting AI usage to business outcomes, whether that is deals closed, tickets resolved, or cycle times reduced.The companies that succeed will treat AI cost management as part of operational excellence, not procurement cleanup. The question should not be, “How many tokens did we use?” It should be, “What business outcome did that intelligence produce?”

Your customers may stop using your interface

While the internal implications of agents are significant, the external ones may be even larger. Today, companies obsess over the customer experience inside their applications: the homepage, the navigation, the checkout flow, the dashboard, the mobile screen. Those things will still matter. But increasingly, customers may interact with businesses through their own agents rather than directly through a company’s app or website.

If a procurement agent compares suppliers, a travel agent books a trip, or a financial agent evaluates products, the customer may never see the interface a company spent years perfecting. The agent will care less about visual design and more about whether the company’s data, policies, pricing, inventory, documentation, and transaction systems are accessible, structured, trustworthy, and machine-readable.

That means the competitive surface area changes. A company’s brand may still be emotional, but its operational interface will increasingly be data. Businesses that expose confusing, inconsistent, or poorly governed information will be harder for agents to work with. Businesses with clean semantics, reliable APIs, governed data, and clear policies will become easier to choose, easier to transact with, and easier to trust.

The interface does not vanish only inside the enterprise. It may vanish between enterprises, too.

The real AI readiness test

Most executives know they need an AI strategy, but fewer have internalized what that really requires. AI readiness is not the number of pilots launched, the number of models tested, or the number of employees with access to a chatbot. It is whether the organization’s knowledge, data, permissions, workflows, and decision logic are ready for machines to reason over them safely.

For decades, enterprise software forced humans to become translators between business intent and machine logic. AI is reversing that relationship. Machines are beginning to adapt to human intent. But they can only do that if the enterprise has done the work to make its own context legible.

The future of software is not another screen. It is a system that understands the business well enough to help run it. And that means the next great interface will not look like an interface at all.

Baris Gultekin is VP of AI at Snowflake.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.

Researchers trained an open source AI search agent, Harness-1, that outperforms GPT-5.4 on recalling relevant information

A joint research collaboration between researchers at the University of Illinois at Urbana-Champaign (UIUC), UC Berkeley, and the open source AI-native vector database platform Chroma unveiled Harness-1, a 20-billion parameter open-source search agent built atop OpenAI’s gpt-oss-20B open source model that fundamentally redesigns how AI executes complex retrieval tasks.

Harness-1 achieves a massive leap in performance, scoring 73% average on its ability to recall relevant information correctly from a curated dataset, outperforming even GPT-5.4 (70.9%) and the next, most accurate open source search agent, Tongyi DeepResearch 30B, by 11.4 percentage points. (While GPT-5.5 has also been out for more than a month, the researchers didn’t test against this model as it wasn’t available when they were building theirs.)

Crucially for developers, the model and its environment are available immediately under the highly permissive Apache 2.0 license and model code/weights on Hugging Face.

Harness-1 also serves as proof-of-efficacy of another effort, Tinker, the distributed, web-based AI model training and fine-tuning API developed by Thinking Machines. Tinker was used specifically to train and run inference for Harness-1, highlighting how interactive infrastructure is actively enabling the next generation of autonomous models.

So how did the researchers do it?

Benchmarks Decoded (and Why Harness-1 Could Help Enterprises Tremendously)

To actually put these models to the test, the researchers evaluated Harness-1 and its competitors across eight highly complex search benchmarks. Rather than asking simple trivia questions, these tests required the AI to act like a real researcher sifting through diverse, dense data sources.

The benchmarks spanned several different domains, including open web searches, complex financial filings from the SEC, technical patent databases from the USPTO, and “multi-hop” question-answering tasks where the AI had to logically piece together scattered clues from multiple different documents to arrive at the correct answer.

When the results came in, Harness-1 dominated the open-source competition in its ability to successfully find and curate the right facts. Even more impressively, this relatively small 20-billion parameter model went toe-to-toe with massive, expensive proprietary AI systems. It actually outperformed heavyweights like GPT-5.4, Sonnet-4.6, and Kimi-K2.5 — thought to be the hundreds of billions or trillions of parameters. Only one giant frontier model—Opus-4.6 — managed to narrowly edge it out in overall average performance.

Harness-1 achieves its performance gains by offloading the exhaustive “bookkeeping” of a search session out of the model’s working memory and into a structured software environment.

As enterprise use cases grow more sophisticated, demanding that models autonomously sift through thousands of corporate documents or financial filings, these systems frequently succumb to “search amnesia”—forgetting their original queries, looping over rejected documents, or losing track of the specific claims they are trying to verify.

Until now, the prevailing solution to this amnesia has been brute force. Engineers typically force models to constantly reread an ever-expanding, append-only transcript of their own actions, piling every search, read, and thought back into a massive context window.

Harness-1 introduces a paradigm shift away from this method, proving that the bottleneck for true artificial autonomy isn’t necessarily the size of the model, but how efficiently its working environment manages state. It highlights once more, as Anthropic’s Claude Code has also done, that the raw model is arguably less important than the harness — or set of conditions — through which it runs.

Technology: Doing the Paperwork in the Environment

To understand the technical leap of Harness-1, consider a real-world analogy.

Imagine hiring a brilliant research assistant and placing them in an empty room without a desk, notepads, or filing cabinets. You ask them to write a comprehensive report on a highly complex topic, which requires them to read dozens of books while keeping every single quote, citation, and dead-end search perfectly memorized in their own head. Eventually, no matter how intelligent the assistant is, their cognitive load will max out, and they will start dropping facts or losing the thread of the assignment.

This is exactly how traditional search agents operate today. They are trained as policies over growing transcripts, meaning the model searches, reads, searches again, and appends everything into its own context window.

As lead researcher Patrick (Pengcheng) Jiang of the University of Illinois noted on X: “At some point the model is not just ‘searching’ anymore. It is also being asked to be a memory system, a note taker, a verifier, and a librarian.”

Harness-1 solves this by giving the AI a desk and a filing cabinet—what the research team calls a “state-externalizing harness.”

This harness is an active, surrounding environment that takes over the routine bookkeeping, maintaining a recoverable working memory that includes a candidate pool of documents, an importance-tagged curated evidence set, compact evidence links, and verification records.

By separating semantic choices from structural state management, the AI is freed up to do what it does best.

The policy still decides what to search, determines which documents to keep, and knows when to stop, while the environment simply holds the state.

Here is a subsection breaking down the training methodology and how it differs from prior agentic search models:

Training Harness-1: A Masterclass in Data Efficiency

The training pipeline for Harness-1 represents a fundamental shift in how the AI industry approaches agentic learning.

Historically, developers have treated search agents as policies operating over massive, ever-growing transcripts, forcing reinforcement learning (RL) algorithms to simultaneously optimize both semantic reasoning and the raw memorization of a search state.

Harness-1’s creators took a radically different approach: because their custom “harness” handles all the routine bookkeeping—like maintaining evidence links, candidate pools, and verification records—the training process only needed to teach the model how to operate this structured interface.

This division of labor drastically simplified what the underlying 20-billion parameter model actually needed to learn.

The process began with a remarkably narrow Supervised Fine-Tuning (SFT) stage. Rather than scraping petabytes of new behavioral data, the team generated just 899 filtered trajectories using a GPT-5.4 teacher agent that was plugged into the exact same harness environment the student model would eventually use.

The goal of this SFT phase was not to inject vast amounts of domain knowledge into the model, but simply to teach it the mechanical rhythms of a good researcher: how to format tool calls, how to tag documents by importance, and the discipline of verifying a claim before promoting it to the final curated set.

Following SFT, the model underwent Reinforcement Learning (RL) using an algorithm called CISPO, applied over full search episodes capping at 40 turns.

The team designed a highly specific terminal reward function that explicitly separated discovery from selection. The model was rewarded not just for finding a relevant document, but for successfully promoting it into the final answer set, while being penalized if it found the answer but failed to curate it.

The researchers also instituted a “tool diversity” bonus; without this specific incentive, they found the policy would quickly collapse into a lazy, search-heavy strategy where it spammed queries but bypassed the harder work of reading and verifying the text.

What makes Harness-1 truly innovative compared to prior work is its unprecedented data efficiency. The entire model was trained on roughly 4,400 unique items—899 SFT trajectories and 3,453 RL queries.

In stark contrast, competing open-source models required vastly larger datasets to achieve worse results: Context-1 utilized over 17,200 training items, while Search-R1 relied on a staggering 221,300 items to learn search behaviors.

By proving that a smarter external cognitive architecture can replace brute-force data scaling, Harness-1 suggests that the future of agentic AI lies in building better environments for models to work within, rather than just training larger models on more data.

Product: Enterprise Applicability and Generalization

From a product perspective, Harness-1 is delivered as a highly capable 20B agent merged into the openai/gpt-oss-20b base architecture.

For enterprise tech stacks, the applicability is massive because businesses need AI to execute multi-step research across proprietary databases without hallucinating or running up exorbitant compute bills.

Harness-1 manages its frontier-level performance at what the creators describe as “Context-1-level cost and latency.” Because the context window is strictly managed by the budget-aware harness rather than continuously expanding, enterprises can deploy this agent autonomously without incurring the exponential token costs typically associated with long-horizon AI tasks.

Even more impressively, Harness-1 proves it can generalize well beyond its training data. According to the research team, it was incredibly cheap to train, utilizing just 899 filtered supervised fine-tuning (SFT) trajectories and a mere 3,453 reinforcement learning (RL) queries.

“Instead of training the model to survive a giant append-only transcript, we train it to use a structured search interface: search, curate, revisit, verify, and submit,” Jiang explained.

This leanness proves a critical point for the AI industry: developers do not necessarily need petabytes of new behavioral data if they build a better cognitive framework for the model to operate within.

Licensing: The Power of Apache 2.0

One of the most significant aspects of the Harness-1 release is its licensing. In plain language, Apache 2.0 is a highly permissive, enterprise-friendly software license that fundamentally enables commercialization.

Unlike “copyleft” licenses (such as the GPL) that can force companies to open-source their own proprietary software if they integrate the code, or “research-only” licenses that ban commercial use entirely, Apache 2.0 gives businesses the green light to freely build, modify, and monetize the technology.

For developers and startups, this means Harness-1 can be seamlessly integrated into commercial enterprise search products, internal data retrieval tools, or customer-facing AI applications without fear of legal reprisal.

The only major requirement is that users must include the original copyright notice and explicitly state any significant modifications they make to the source code, positioning Harness-1 as a highly viable foundational building block for the enterprise.

Community Reactions: A Resounding Validation

The announcement has clearly struck a nerve within the developer community, validating the very real pain points engineers face when building agentic systems. Jiang’s multi-part announcement thread on X quickly garnered massive traction, pulling in over 256.1K views, 3.7K likes, 2.9K bookmarks, and nearly 300 reposts within a matter of days.

This high engagement underscores a growing consensus in the AI space that brute-forcing context windows is a losing battle.

When Jiang posted on X, “I’ve been wondering: maybe search agents are bad at search partly because we make them do all the paperwork in their head,” the resonance was immediate.

For developers who have spent the last year wrestling with AI agents that confidently forget their primary instructions halfway through a database search, the Harness-1 approach feels like a desperately needed course correction.

Ultimately, the community sentiment highlights a shift in industry priorities. Developers are moving away from asking how large an AI model’s context window can get, and instead asking how efficiently an AI model’s environment can manage that context for it. By offloading the paperwork, Harness-1 is proving that smaller, smarter systems can outmaneuver the giants—provided they have the right desk to work at.

When Claude changed, everything changed: Managing AI blast radius in production

Our system did one thing, and it did it well: It turned natural-language questions into API calls.

The users were analysts, account managers, and operations leads. They knew what data they needed, but assembling it manually meant pulling from four dashboards, two BI tools, and a Salesforce report builder. With our system, they typed the request in plain English. A request like “Compile a report on sales volume for January through March 2026 for the Northeast region, broken down by city” was translated into an API call that the system could act on:

json

{

  “description”: “User requested sales volume for the given date range, here is the API call to get the response”,

  “api_call”: “/api/sales_volume”,

  “post_body”: {

    “start_date”: “2026-01-01”,

    “end_date”: “2026-03-31”,

    “region”: “northeast”

  }

}

The rest of the pipeline was conventional engineering. The system dispatched the call to the right backend — we had integrations with internal reporting portals, Salesforce, and several homegrown services — applied a large language model (LLM)(-generated JSON query to filter and shape the response, and delivered it via email, as a Drive document, or rendered as a chart in the browser.

By mid-2025, the system was generating several hundred reports a month. These reports were consumed by leadership and analysts and circulated to external stakeholders. It had become the default way most teams pulled ad-hoc data.

The contract between the LLM and the rest of the system was a structured JSON object as described in the above example.

json

{

  “description”: “User requested sales volume for the given date range, here is the API call to get the response”,

  “api_call”: “/api/sales_volume”,

  “post_body”: {

    “start_date”: “2026-01-01”,

    “end_date”: “2026-03-31”,

    “region”: “northeast”

  }

}

We built it on Claude Sonnet 3.5 in early 2025. We upgraded to 3.7 without incident, and to 4.0 without incident. By the time Sonnet 4.5 shipped, we had grown complacent about the stability and predictability of LLMs in solving what we believed was a simple problem. Model upgrades had become routine, like bumping a minor version of a well-behaved library.

Then we rolled out 4.5. For a meaningful percentage of requests, the model began folding the contents of post_body into the description field. Two failure modes followed.

First, the filter parameters never reached the API. Our system read post_body as the source of truth for the request payload, and that field came back empty. The API call was made without the date range or region filter. Depending on the specific API being called, the backend either returned sales volume for all time or all regions or returned a 500 error.

Second, the model started asking clarifying questions in its response. This was new. Earlier versions always took a best-effort approach to an ambiguous request and returned a structured object. Sonnet 4.5, being more cautious, would sometimes respond with a question instead. Our system had no path for this. It had been built on the assumption that every model invocation would result in an API call. There was no human-in-the-loop component and no state to hold a partially completed request. This caused downstream systems to break in multiple ways.

We rolled back to 4.0. That was harder than it should have been: Between the 4.0 and 4.5 deployments, our team had added new API integrations, all of which were qualified against 4.5. Reverting the model meant requalifying every one of them against 4.0 under time pressure.

Why traditional engineering discipline fails here

Software engineering rests on the ability to bound the effect of a change. When you upgrade a driver or library, you read the release notes to see whether to expect breaking changes. Unit tests circumscribe what could possibly have moved. You can leverage the following property: The system being changed is deterministic enough that its behavior can be predicted, or at least sampled densely enough to give you confidence. The blast radius is bounded by construction.

LLM-backed systems break this assumption. The component that produces your output is not under your control. You cannot diff a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality on which your system depends.

This is what we mean by an infinite blast radius: a change whose downstream effects cannot be enumerated in advance because the input space (natural language) and the failure modes (anything the model might do differently) are both unbounded.

Anatomy of the failure

The post-mortem revealed that our prompt had always been under-specified. We had told the model to return a JSON object with three fields. We had described what each field was for. We did not explicitly state that the description must be a natural-language string and must not contain serialized representations of other fields.

Earlier versions of the model inferred this constraint from context. Sonnet 4.5, evidently better at being “helpful” in its formatting choices, decided that inquiring for clarification or providing the request body in the description made the response more useful. From the model’s perspective, this was a reasonable interpretation of an ambiguous instruction. However, this violated the assumptions under which our system was built.

The bug was not in the model. The bug was in our assumption that the model would continue to fill in our specification gaps as it always had. Three successful upgrades had trained us to believe those gaps were safe.

Structured output modes and tool-use APIs would have caught this specific failure at the schema level. We weren’t using them for engineering reasons outside the scope of this article. But schemas only constrain syntax, not semantics. A schema cannot specify that a clarifying question shouldn’t appear in a system with no path for clarification, or that a date range should never silently default to all-time. Schemas solve the easier half of the problem.

The evals-first architecture

The discipline that closes this gap is to treat the evaluation suite — not the prompt — as the formal specification of the system. The prompt is an implementation of the spec. The model is an interpreter. The evals are the spec itself, and any model or prompt change is valid if and only if it passes them.

In practice, an eval is a triple: An input, a property the output must satisfy, and a scoring function. For our system, the eval that would have caught the 4.5 regression looks roughly like this:

python

def test_description_contains_no_serialized_payload(response):

    desc = response[“description”].lower()

    forbidden = [“curl”, “post_body”, “{“, “http://”, “https://”]

    assert not any(token in desc for token in forbidden), \

        f”description leaked structured content: {response[‘description’]}”

A few hundred such properties, some written by hand for known-important invariants, some generated as regression tests from real production traffic, some scored by an LLM-as-judge for fuzzier qualities like tone, become a gate. Model upgrades and prompt changes should be treated as pull requests that must turn the suite green before they merge.

Evals are expensive to build and maintain. They drift as your product changes. LLM-as-judge scoring introduces its own variance in outcomes. And the suite can only catch failure modes you have thought to specify — you cannot eval your way to safety against a category of failure you have never imagined. We learned this lesson the hard way: Nobody on our team had ever written an assertion that said “the description field should not contain a curl command,” because nobody had thought the model would put one there.

Evals are not a silver bullet. They give you the ability to bound the blast radius of a change in the only way available when the underlying function is a black box: By densely sampling the input-output response you actually care about, and refusing to deploy when that behavior moves.

The roadmap

The engineering community has yet to develop a body of knowledge for writing effective evals. There are no widely accepted standards for what ‘coverage’ means in natural language input spaces. CI/CD systems were not built to gate probabilistic test outcomes. As agents take on more autonomous work — writing code, moving money, scheduling infrastructure changes — the gap between “the model passed our smoke tests” and “we know what this system will do in production” becomes the central engineering problem of the next several years.

The teams that close that gap will be the ones who stop treating evals as a quality-assurance afterthought and start treating them as the actual specification of what their system is.

Vijay Sagar Gullapalli is Founding AI Engineer at Adopt AI and a USPTO-patented inventor.

Sarat Mahavratayajula is a Senior Software Engineer at Sherwin-Williams.

Microsoft’s AI Futurist explains how he uses Copilot — and the real-world problems enterprises are solving with agents

Microsoft used its Build 2026 conference this week to push a clear message: agents are rapidly moving into production throughout enterprise systems, and the winning platform will be the one that gives them reliable context, governance, identity, memory…

AI agents are learning on the job — just not for your whole team

When someone on a team corrects an AI agent — better prompts, better feedback, better context — that improvement disappears the moment a colleague opens the same tool. The correction doesn’t transfer, and the next person starts from zero.

The problem compounds in multi-agent workflows, where teams expect agents to share context across users and tasks. Without a shared memory layer, every team member effectively trains a different version of the same agent — and those versions never sync.

That gap shows up in the numbers. According to Asana’s own research, 75% of knowledge workers use AI on the job, but only 5% of companies have reported productivity gains. 

“Model providers are getting really, really good at improving reasoning and retry loops, but what they’re not good at is bringing the enterprise work context in a way that human beings can reason about for shared memory,” Asana Chief Product Officer Arnab Bose told VentureBeat. 

Asana had been building toward an agentic platform that centers context and shared memory. Its Agentic Work Management platform ensures that if any team member corrects an agent, that correction applies to everyone else on the team. 

“That context graph is automatically provided to agents operating inside Asana’s system so you don’t have to have every human member of the team become an expert at prompt engineering or context engineering,” Bose said. 

Bose said the shared memory architecture matters beyond Asana’s own product; it’s the design decision enterprises need to make for any multi-agent system.

Shared memory also becomes important when enterprises begin moving from simple single agents to multi-agent workflows that need to share context and behaviors. 

Memories for a multi-agent, multi-platform workflow

The models powering agents are stateless by design, so memory becomes a dedicated layer outside of a context window. While this area of AI innovation is marching towards maturity, the question of what gets stored, who controls it, and how it stays consistent when different agents and users write to the same instance remains largely unsolved.

This is manageable for use cases with only one user. However, in enterprise agentic workflows, the idea is for agents to work with the entire team. Most platforms have agents that still act for individuals, which leads to task repeating and inconsistent versions of reality and spreading mistakes. Agents could then also contradict each other.

Sriharsha Chintalapani, co-founder and CTO of Collate, said in an email to VentureBeat that the lack of shared memory is a major obstacle for multi-agent workflows particularly around consistency.

“Agents are sensitive to the quality of their prompts,” Chintalapani said. “Someone with a strong understanding of the task will generally get more accurate results than someone less experienced. Partly that’s because they’re able to construct more detailed prompts, but also because they’re able to give the agent better feedback. The agent remembers the corrections it’s received and applies that knowledge to successive prompts. The more accurate the feedback, the better the agent will perform for that user. “

He added that organizations should stop treating shared memory solely as a prompt engineering problem and think of building systems that repeat context across every conversation.

Neej Gore, chief data officer at Zeta Global, said in a separate email that shared context becomes a living memory that “compounds intelligence across the enterprise.”

The opportunity may lie in building AI agents that retrieve memory relationally, pulling in relevant context based on what’s being asked — an approach Chintalapani says few organizations outside the largest model providers are equipped to build.

Personal versus team agents

AI agents already proliferate enterprises; it’s just that many of these operate as personal agents doing work specific to individual users. Most prompts start from one person, any files are uploaded by one account, and even for agents living in a company-wide system mostly learn individual user preferences. 

Most enterprise AI workflow platforms recognize that memory is important but approach it through different lenses. For example, Microsoft’s Copilot takes an individual-first approach by learning a user’s role within the organization, tone preferences and working patterns, which are then stored as personal memories for the agent to apply across the different Microsoft 365 surfaces.

For engineering and orchestration teams evaluating agentic platforms, the shared memory question is now a procurement criterion — not just a technical nicety. An agent that learns only for the person using it will require ongoing individual upkeep. One connected to a team-wide memory layer builds institutional knowledge automatically.

OpenAI’s Codex update lets agents build interactive enterprise workspaces via Sites and role-specific plugins

Agentic AI is moving rapidly from the developer terminal to the corporate world.

On Tuesday, OpenAI announced a major update of its agentic AI platform Codex, introducing domain-specific workflows, a rapid, semi-private web hosting feature within it for enterprises called “Sites,” and an in-place editing tool named “Annotations”.

The release marks a deliberate strategy to transform Codex from a specialized programming assistant into an everyday operating environment for business professionals.

Non-developers—including financial analysts, marketers, operators, and researchers—now constitute approximately 20% of the platform’s 5 million weekly users and are adopting the technology three times faster than traditional engineers, according to research shared by OpenAI with VentureBeat and other outlets.

OpenAI is capitalizing on this shift to position Codex as the premier application for white-collar task automation. The timing of the announcement is highly strategic, arriving precisely as its own primary investor turned business rival Microsoft this week kicks off its annual BUILD developer conference in San Francisco—where a slate of competing enterprise productivity tools is expected—and hot on the heels of Anthropic’s rapid adoption among knowledge-workers via its Claude Cowork and Claude Code platorms.

Annotations enable more precise agentic AI spreadsheet edits and updates

For business users, the most critical technical upgrade is the elimination of full-document regeneration. Previously, instructing an AI to update a specific chart or spreadsheet calculation often meant the model had to rewrite the entire file, which frequently broke custom formatting or introduced hallucinations.

OpenAI addresses this through Annotations, a localized context-scoping mechanism. As demonstrated in the company’s release materials, the platform maps a document’s underlying data schema.

When a user highlights a specific segment—such as a block of cells in a financial model—Codex isolates those exact data arrays.

If an analyst prompts the system to “Add a chart of revenue, EBITDA, and net income over the selected years,” the model executes the code strictly within that boundary, generating the visualization while leaving the surrounding cell dependencies, styles, and unselected formulas completely untouched.

New role-specific Plugins for enterprise functions that bundle skills and external SaaS app connections

To further anchor Codex in daily enterprise operations, OpenAI has introduced modular software bundles and a rapid-prototyping hosting environment.

The company is rolling out six role-specific plugins that aggregate 62 popular business applications (including Snowflake, Figma, and Salesforce) and 110 automated skills straight out of the box.

  • Data Analytics: Unifies cloud environments like Snowflake, Databricks Genie, Hex, and Tableau to translate natural language inquiries into data reports and change-analysis dashboards.

  • Creative Production: Connects Figma, Canva, Shutterstock, Picsart, and Fal to generate and iterate on ad variations, campaign boards, and e-commerce assets directly from text briefs.

  • Sales: Integrates pipeline infrastructure across Salesforce, HubSpot, Slack, Outreach, Clay, Rox, and Actively to automate follow-up communications, close plans, and account risk reviews.

  • Product Design: Bridges Figma and Canva environments to audit live user journeys and transform static wireframes into clickable prototypes.

  • Public Equity & Investment Banking: Syncs institutional market feeds—including Moody’s, Daloopa, Datasite, FactSet, LSEG, S&P, PitchBook, and Hebbia—to streamline financial modeling, competitive landscaping, and pitch book preparation.

These integrations allow distinct departments—from data analytics and creative production to sales and investment banking—to automate complex, multi-step workflows without requiring IT to build custom API connections.

Sites allow users to spin-up dynamic, hosted webpages they can share with their colleagues

Concurrently, the new Sites feature introduces an interactive canvas that converts static data inputs or text documents into functional, web-hosted internal applications.

Rolling out in preview for Business and Enterprise tiers, Sites allow cross-functional teams to bypass front-end development.

Financial leaders, for example, can transform a static spreadsheet into an interactive scenario planner shared via a secure workspace URL, allowing executives to tweak assumptions in a live web app rather than clicking through document tabs.

Instead of static decks, Sites promise to keep enterprises updated on their latest metrics and important information in an easily digestible way.

Availability & deployment

A critical operational distinction in this rollout centers on exactly where these new features can be executed. Codex’s existing infrastructure runs natively across multiple surfaces, including IDE extensions and the terminal command line.

However, the release documentation notes that Sites are rolling out “through the Codex app” and that plugins are managed via a “Codex plugin directory”.

An OpenAI spokesperson confirmed that Plugins and Sites are available int he CLI and desktop app, while Sites are hosted by OpenAI.

Licensing and pricing

These updates operate entirely within OpenAI’s closed, proprietary enterprise licensing model. Unlike open-source frameworks, enterprise clients do not maintain code-level ownership over Codex’s integration nodes.

Instead, system administrators manage deployment through centralized workspace settings, giving them explicit authority to enable or disable hosted “Sites” and restrict underlying application permissions.

These new capabilities deploy seamlessly on top of Codex’s existing commercial framework. Users will continue to access the agent via established baseline subscription tiers—such as the individual “Plus” plan ($20/month) or the high-volume “Pro” plan ($100/month)—or through a separate, seat-free pay-as-you-go model that draws down pre-purchased utility credits.

The AI agent bottleneck isn’t model performance — it’s permissions

Enterprise AI agents are stalling — not because of model performance, but because of permissioning. Every agentic workflow eventually hits the same wall: what is this agent allowed to touch, on whose behalf, and how does the system know?

Workday’s answer is to make its existing system of record the governance layer for agents. Gerrit Kazmaier, the company’s president for product and technology, told VentureBeat in an interview that customers often struggle when they cobble together solutions for their agents. 

“Sana makes sure the integrity of the approvals and security model is always adhered to,” Kazmaier said. “Frankly, that’s where we see customers struggling when they try to build do-it–yourself AI by just accessing raw data, so the richness of the security model gets lost, and the results become overly broad.”

Workday, which launched Sana in March, expanded its partnership with Google to bring its Sana agent system of record to the Gemini Enterprise — so agents built on Sana are also discoverable there.

Architecting accuracy

Kazmaier said the biggest hurdle they faced was ensuring agent accuracy, especially for HR and finance users. 

“Almost right is not acceptable,” Kazmaier said. “Think about paying people correctly, closing the books or managing work schedules reliably.”  

Accuracy is harder to evaluate here than in most AI contexts. Policy configurations, role-based security, and organizational hierarchies are deeply interrelated — a small error compounds. And unlike most generative AI outputs, HR and finance queries often lack a correction loop. By the time a paycheck processes incorrectly or an interview is scheduled wrong, the damage is done.

Workday addressed this by building Gemini in as its base reasoning layer, then adding its context engine and business process logic on top. Workday also added verification and classification models that “interrogate” outputs before execution. 

Accuracy and identity, it turns out, are the same question: does the system know enough about the agent, the authorizing human, and the current state of the record to act correctly?

Workday’s advantage is that it can infer its customers’ organizational structures from the data they provide. Already, third-party identity providers like Okta verify their information by checking Workday, so its context is the system of record for many enterprises. Kazmaier said the Sana Self-Service Agent uses Gemini as the conversational surface to trigger the workflow. The user is then authenticated and authorized through Workday’s identity and security model. Sana agents will only act on behalf of that user and work within their current permissions. 

Audit trails follow the same logic: Gemini retains only interaction logs, while the main audit remains within Workday and its customer. 

For many practitioners in the HR and finance space, the permission and governance layer in the agent system of record is key in regulated spaces. 

“It has to live in the system of record, that’s not a preference, that’s the only way it works,” said Dan Obendorfer, director of product at Würk, in an email to VentureBeat. “If your permissions are defined somewhere outside of where the data actually lives, you’ve already lost.”

Kadan Stadelmann, chief technology officer and co-founder of Compance.AI, made the same point separately. “Without agent ownership, performance, costs or actions, chaos ensues.”

MIT’s MeMo lets teams swap in a better LLM without retraining — and performance jumps 26%

Enabling LLMs to acquire new knowledge after training remains a major hurdle for enterprise AI — current solutions are either too expensive, too slow, or constrained by context window limits.

MeMo, a framework from researchers at multiple universities, encodes new knowledge into a dedicated smaller memory model that operates separately from the main LLM.

The modular architecture works with both open- and closed-source models and sidesteps the complexity of RAG pipelines and full model retraining.

Experiments show that MeMo handles complex queries reliably even when retrieval pipelines are noisy. It avoids the catastrophic forgetting associated with direct fine-tuning and provides a cost-effective pathway for continuous knowledge updates.

The challenge of updating LLM memory

Large language models are frozen after training and their internal knowledge remains static until they undergo subsequent, computationally massive updates.

Currently, developers rely on three main approaches to integrate external knowledge into an LLM, each with distinct drawbacks:

Non-parametric methods, such as retrieval-augmented generation (RAG) and in-context learning, retrieve relevant documents from an external database and insert them directly into the model’s prompt. While popular, these methods are limited by context window sizes. 

As Armando Solar-Lezama, a co-author of the paper, told VentureBeat, “Vector databases have a fundamentally difficult job of encoding the full semantics of a chunk of text in a single vector, and then match that vector to a query, even when the relevance of the chunk… may only be apparent in the context of other chunks.” 

The researchers note that the semantic similarity of embeddings often does not correspond to what a user’s query actually requires. Processing thousands of retrieved tokens also creates substantial computational overhead and inference latency. Most problematically, RAG systems are highly sensitive to noise. Irrelevant or poorly retrieved passages often degrade the model’s final response.

Parametric methods, like continual pretraining or supervised fine-tuning, attempt to internalize new knowledge directly into the LLM’s weights. Updating modern, massive LLMs is prohibitively expensive and typically impossible for proprietary, closed-source models hidden behind APIs. Fine-tuning is also prone to causing catastrophic forgetting. Forcing the model to adapt to new corporate data often erodes its previously acquired reasoning capabilities and safety guardrails.

Latent memory methods, such as context compression, offer a middle ground. They compress knowledge into compact “soft tokens” or representations that are added to the model’s context during inference. The fatal flaw here is “representation coupling.” The compressed memory is strictly bound to the model architecture that produced it; you can’t transfer a latent memory trained on an open-source model to a closed-source one.

How MeMo works

The MeMo (Memory as a Model) framework introduces a modular architecture featuring two separate components. The MEMORY model is a small language model trained specifically to encode new knowledge into its parameters. The EXECUTIVE model is a frozen, off-the-shelf LLM that functions as the reasoning engine. When a user asks a question, the EXECUTIVE model treats the MEMORY model as an external oracle, issuing targeted sub-queries to gather facts and synthesizing those facts into a final answer.

The core design principle driving MeMo is the concept of “reflections.” Reflections are targeted question-answer (QA) pairs designed to capture every possible angle of a knowledge corpus. Rather than forcing the AI to process a massive, unstructured document corpus during training, MeMo uses a GENERATOR model to distill the raw text into thousands of targeted QA pairs. The MEMORY model is then fine-tuned on this dataset to answer questions using only its parametric knowledge without the need to read retrieved context.

At inference time, the interaction between the two models follows a structured, three-stage protocol:

1. The EXECUTIVE model decomposes a user’s complex query into a set of atomic sub-questions. The MEMORY model answers each independently to establish the basic facts.

2. Using those initial clues, the EXECUTIVE model issues follow-up queries to narrow down candidate entities until it confidently converges on a specific target. 

3. Finally, the EXECUTIVE model queries the MEMORY model for supporting facts about that target entity and synthesizes the retrieved snippets into a cohesive answer.

This architecture merges the strengths of the three existing AI memory paradigms while bypassing their pitfalls. It leverages off-the-shelf frontier models by keeping memory storage separate from reasoning, guaranteeing compatibility with both open-weight and closed API models. It internalizes knowledge directly into parameters, but isolates the updates to a smaller, dedicated MEMORY model to protect the reasoning engine. Finally, it creates a queryable memory artifact that is not tied to any specific model and can be used with different LLM families.

Handling continual knowledge updates

Managing an AI’s memory requires continuous updates as company policies change and new reports are published. Normally, updating a model’s parameters requires retraining it from scratch on both the old and the new data combined. As the knowledge base grows, this cumulative retraining cost becomes unmanageable.

To handle continual updates efficiently, MeMo relies on a technique called “model merging.” Instead of a massive joint retraining phase, MeMo trains a new, independent MEMORY model exclusively on the newly added documents. The system derives a “task vector” representing the parameter changes learned from the fresh data. These updates are then mathematically merged into the weights of the original MEMORY model.

This approach reduces the computing hours required to keep the system current while avoiding the interference that causes catastrophic forgetting. 

This efficiency comes with a trade-off: model merging incurs an 11% to 19% accuracy drop compared to a full retrain, depending on the reasoning model used.

MeMo in action

To measure real-world effectiveness, the research team evaluated MeMo against several industry benchmarks that require complex, multi-hop reasoning across multiple documents.

The researchers used Qwen2.5-32B-Instruct as the GENERATOR model to distill raw text into reflections. For the primary MEMORY model, they deployed Qwen2.5-14B-Instruct. They also validated the approach on smaller 1-2B parameter models across different architectures, including Gemma3-1B. 

For the EXECUTIVE reasoning model, they tested both the open-weight Qwen2.5-32B and Google’s proprietary Gemini 3 Flash.

They benchmarked MeMo against a “Perfect Retrieval” upper bound (where the exact correct documents are manually provided) and several advanced retrieval systems, including traditional BM25 search, dense vector retrieval, and state-of-the-art graph-based RAG (HippoRAG2). They also tested “Cartridges,” a recent method that loads a trained KV-cache onto the model during inference.

MeMo dominated in long-document reasoning. On the NarrativeQA benchmark, MeMo achieved 53.58% accuracy paired with Gemini 3 Flash, according to the researchers. HippoRAG2 maxed out at 23.21%.

Enterprise systems frequently need to synthesize complex answers, such as traversing overlapping regulatory frameworks written independently by different bodies, or consolidating insights across a massive codebase and external documentation. Traditional RAG systems falter here because they hit context window limits and fail to connect concepts spanning hundreds of pages. MeMo succeeds because those connections are mapped and internalized inside the MEMORY model during training. It is “like having your very own Malcolm Gladwell that can connect the story of the Beatles with the story of Bill Gates to make an argument about the nature of expertise,” Solar-Lezama said.

The experiments revealed another major advantage: upgrading the reasoning engine requires zero retraining. Simply switching the EXECUTIVE model from the open-source Qwen to the proprietary Gemini 3 Flash boosted MeMo’s performance by 26.73% on NarrativeQA and 11.90% on the MuSiQue benchmark. For practitioners, this means you can train a MEMORY model securely on your private data and instantly plug it into the latest commercial APIs, continuously upgrading system intelligence without incurring new training costs. 

The research team described the integration as requiring no additional setup: “The base (or Executive) LLM that teams are already using in RAG can be configured to query the Memory model directly. These queries are done in natural language, similar to sending a message request to an API, with no additional setup required.” 

MeMo also handles noisy data exceptionally well. When researchers deliberately flooded the dataset with irrelevant documents (up to twice the amount of the useful information), HippoRAG2’s performance dropped by 11.55%. MeMo’s performance remained relatively stable, dropping less than 2%. Enterprise knowledge bases are typically messy, filled with duplicate documents and outdated policies. Standard RAG systems struggle with this noise, pulling incorrect paragraphs into the prompt and causing hallucinations. Because MeMo’s EXECUTIVE model interacts with a synthesized oracle rather than raw document chunks, it remains highly robust against disorganized corporate data.

Limitations and trade-offs

For engineering teams looking to deploy MeMo, there are several key limitations to consider.

Unlike traditional RAG systems that quickly index raw documents into a vector database, MeMo requires an upfront training cost for each new corpus. The data generation pipeline used to synthesize the training reflections is computationally expensive. For example, the team noted that “generating the full reflection QA dataset took approximately 240 GPU-hours on NVIDIA H200s,” while training a 14B parameter MEMORY model “took approximately 180 H200 GPU-hours.” As Solar-Lezama said, “Reducing the training cost is one of the most significant open research problems in order to make this a workhorse technique.”

Because the MEMORY model is a fixed-size neural network, its ability to internalize knowledge is bounded by its representational capacity. While the researchers did not hit a hard limit during their benchmarking, they hypothesize that “sufficiently large or information-dense corpora will exceed what a fixed-size MEMORY model can correctly compress and represent.”

Finally, because MeMo synthesizes answers from parametric memory rather than retrieving exact text snippets, it obscures the provenance of the information. This makes it difficult to attribute specific claims to original source documents, which poses a critical compliance issue for enterprise applications requiring strict audit trails.

Deciding between MeMo and traditional RAG comes down to a heuristic of “lookup vs. synthesis,” alongside data volatility. The researchers advise that “traditional RAG would be preferred when answers live in a single document or when there is a well-defined source… MeMo would be preferred when the task shifts from lookup to synthesizing an answer from information scattered across multiple chunks.” If your knowledge corpus changes rapidly (e.g., daily feeds) and you require exact source citations, RAG remains the better option due to the upfront training cost of MeMo. If your corpus consists of generalized domain knowledge that evolves slowly relative to its volume, MeMo offers vastly superior reasoning. Teams can also adopt a hybrid routing architecture in production: sending “lookup” queries to a standard vector database and “synthesis” queries to the MEMORY model.

“Looking further out, I would expect memory models to become a standard architectural component alongside retrieval,” Daniela Rus, co-author of the paper and director of the MIT Computer Science and Artificial Intelligence Lab (CSAIL), told VentureBeat, “in the same way that caching and indexing are standard components of any serious data system today.”

Pinterest cut AI costs 90% by gutting a frontier model’s vision layer

At 620 million monthly users, calling a frontier model for every image recommendation isn’t a strategy — it’s a bill. Pinterest CTO Matt Madrigal solved it by gutting Qwen3-VL’s vision layer and rebuilding it with proprietary embeddings, cutting costs 90% and boosting accuracy 30%.

Madrigal’s team has been heavily investing in customizing open-source models “foundationally in-house.”

“If you’ve got really unique data that you can then fine-tune an open source model with, data quality will, frankly, outweigh or overcome model size,” Madrigal explained in a recent VB Beyond the Pilot podcast

How Pinterest customized Qwen for visual discovery

Pinterest, which has around 620 million monthly active users, has long applied open source models for visual search and discovery, going back to Google’s BERT and OpenAI’s CLIP. The company fine-tuned its own Pin CLIP on the latter, incorporating proprietary visual embeddings and image metadata. 

Pinterest’s conversational shopping assistant, Navigator 1, was built on Qwen3-VL and customized in “pretty significant” ways. Madrigal’s team essentially “ripped out” Qwen’s vision encoder layer and fine-tuned the model on proprietary multimodal embeddings. This has allowed them to capture metadata around pins and images that can then be precomputed offline and regularly retrained on new information to deliver personalized experiences. 

“Open-source models, especially with open Apache licenses where you can truly tweak a lot of open weights and customize for unique use cases — that’s where we’ve found open source to be so powerful for us,” Madrigal said. 

Bringing their own embeddings allows his team to gain context around metadata, pins, and images; also, notably, the model performs better at runtime and inference. Without these embeddings, devs would have to call and encode each image returned at runtime, one at a time. That results in a latency “20 times worse” from an inference perspective, Madrigal said. 

“If it’s something that’s going to be critical for our end users, that’s going to drive engagement, that will have to scale to over 600 million monthly active users, we’re going to either probably build it or we’re going to leverage open source and customize the heck out of it,” he said. 

How a taste graph captures evolving interests

To guide users from inspiration to purchase, Madrigal’s team built a “taste graph”: a dynamic representation of what individual users actually like, not just what they click on. “It’s this representation of billions of people’s evolving tastes,” he said. 

People go to Google or other search engines when they have a clear picture of what they want; Pinterest is for when they’re still in the discovery phase, Madrigal said. Pinterest’s goal is to encourage “lateral exploration” and transform discovery to intent (that is, clicking through ads or making purchases). 

Under the hood, the architecture combines a graph structure with representational learning. User embeddings capture a user’s evolving tastes. These are constantly updated based on activity and new content and signals. “It’s not a social graph,” Madrigal said. “It’s much more of a preference graph: What’s going to inspire you? What are you trying to do next?” 

For instance, one user may be into mid-century modern designs; another may prefer a Nantucket aesthetic. Those preferences will be captured in user embeddings, and the taste graph will deliver up specific, relevant products as a result. 

“You go from the upper funnel, inspiration discovery, all the way through lower funnel intent,” Madrigal said. 

Listen to the full podcast to hear more about:

  • How Pinterest uses sandboxes to encourage creativity in a way that is secure and contained; 

  • Why a continuous feedback loop can prevent visual AI slop; 

  • The importance of constant benchmarking to gauge user engagement, performance, latency, and other factors. 

You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.