AI agents need context everywhere they run, even where the cloud can’t follow

The competitive edge in enterprise AI is shifting to context: which platform can give an agent the right memory, the right retrieval and the right data at the moment of decision.

Couchbase on Tuesday announced its AI Data Plane, combining persistent agent memory, real-time context retrieval and an enterprise-managed MCP server in a single operational platform. 

Couchbase’s roots are in caching and high-transaction databases — an architecture the company argues makes it better suited for agent memory than vendors that came to the problem from search or analytics. The AI Data Plane runs identically across cloud, on-premises and disconnected edge environments, extending agent memory and local vector search to devices with no network connection.

“How do you make sure that the intelligence that you get out of these models are the ones that databases specialize in?” Gopi Duddi, CTO at Couchbase, told VentureBeat. “How can you get that value out of storage systems, which are still going to be databases?”

What the AI Data Plane delivers

The AI Data Plane packages three components designed to replace the fragmented stacks most enterprises are currently running.

Agent memory: A unified persistence layer for conversational context, structured operational data and vector embeddings. Couchbase says the guardrails are what distinguish it from standalone memory services: token constraints per session, time-to-live limits on stored memories and metering controls that cap compute consumption per agent session.

Enterprise MCP server: An enterprise-supported self-managed server for standardized model-context protocol integration, shipping as part of the platform rather than requiring a separate service.

Agent catalog: A function-level catalog of discoverable agent tooling built by Couchbase. Duddi distinguished it from metadata catalogs like Databricks Unity or AWS Glue — describing it, in his words, as closer to a glorified MCP that surfaces agent functions as callable tools within the platform.

Memory-first architecture takes agent context to the disconnected edge

The lineage of Couchbase and its core architectural foundation is what Duddi says gives it an edge when it comes to context.

“We were a cache before we became a database,” Duddi said.

Writing to memory is 10x faster than writing to disk, Duddi said — a speed advantage he argues separates Couchbase from NoSQL databases that layer memory workloads on top of disk-based storage.

Couchbase isn’t the only data technology that has its roots in a caching layer. Redis similarly is rooted in cache and also recently announced an agentic AI context layer. Duddi argued that Couchbase is different in that it maintains an ACID (Atomicity, Consistency, Isolation, and Durability) compliant database which matters for transactional workloads. Couchbase also has a long history across multiple deployment modalities.

That architecture extends to the edge through Couchbase Lite, the platform’s on-device runtime. It runs SQL, full-text search and vector search locally without a network connection, using a proprietary sync mechanism to replicate bidirectionally back to cloud or between edge nodes when connectivity returns. The target environments are retail floor operations, field service, industrial deployments and regulated settings where agent data cannot leave the device.

Duddi cited hotel reservations as an early example: multiple agents serving customers concurrently, each pulling local context and running vector search on-device, with shared session memory synchronizing centrally. The practical benefit is token efficiency. Rather than every agent independently retrieving and processing the same data, the platform caches shared context so concurrent sessions draw on it without burning tokens repeatedly.

Agora’s view from production

Agora, a platform that helps developers embed real-time voice, video and conversational AI into enterprise applications, has run Couchbase in production since February 2024.

The initial use case was its Signaling product, managing channel setup and state synchronization for live calls. Expanding into conversational AI agents brought stricter requirements: memory-first architecture, full JSON support for storage and query, cross-datacenter replication for high availability and enterprise-grade vendor support.

“Couchbase was the best fit based on these criteria,” Patrick Ferriter, SVP of Product at Agora, told VentureBeat.

Agora is now extending that relationship to support context retrieval for conversational AI agents.

“This will simplify the architecture and deliver enterprise grade RAG with predictable lower latency required for conversational AI use cases,” Ferriter said.

For data professionals trying to figure out the best approach to context, there is no one answer. On platform selection, Ferriter was direct.

“It depends on the preference and goals of the organization, including timing,” Ferriter  said. “If they want something enterprise grade and optimal for immediate production and scale vs. having to optimize and maintain an open-source solution with community support. We wanted the former and that is why we looked at an expanded partnership with Couchbase.”

Competitive context: following the right trend

The context layer has become a crowded space in 2025.

Oracle put a memory core in its database back in March providing a context layer. Redis added a context layer in May as did vector-native database vendor Pinecone.  

“Couchbase is following this trend, not setting it, but it’s the right one to follow,” Devin Pratt, Research Director for AI, Automation, Data and Analytics at IDC, told VentureBeat. “Its real edge is reach, running the same platform from cloud to edge to mobile, which is how enterprises actually operate. The test now is to scale against bigger names.”

For teams navigating the vendor landscape, Pratt’s framing is direct. “Match the tool to the workload. Consolidate where it makes sense, use a specialized engine like a graph database where relationship-heavy reasoning earns it, and let governance drive the call rather than treating memory as plumbing,” Pratt said.

Mistral launches OCR 4, turning document extraction into a full enterprise AI play

Mistral AI on Tuesday released OCR 4, a document intelligence model that moves beyond raw text extraction to return structured representations of entire documents — complete with bounding boxes, block-type classification, and per-word confidence scores. The release marks Mistral’s fourth generation of optical character recognition technology in roughly 15 months and lands at a moment when the company’s pitch for European AI sovereignty has never been more commercially relevant.

The model supports 170 languages across 10 language groups, accepts PDF, DOC, PPT, and OpenDocument formats, and can be deployed as a single container on an organization’s own infrastructure — a capability Mistral is positioning directly at enterprises in regulated industries that cannot route sensitive documents through U.S.-jurisdiction cloud APIs.

“Mistral OCR 4 extracts and structures content from a wide range of documents,” the company said in its announcement. “Where previous generations focused on converting a page into clean text and tables, OCR 4 returns a structured representation of the document.”

The model is available immediately through the Mistral API, Document AI in Mistral Studio, Amazon SageMaker, and Microsoft Foundry, with Snowflake Parse Document support coming soon. Pricing starts at $4 per 1,000 pages, dropping to $2 per 1,000 pages through a batch API discount.

OCR 4 treats every document as a semantic map, not a wall of text

The central engineering shift in OCR 4 is structural. Rather than outputting a flat stream of extracted text — the paradigm that has defined OCR for decades — the model returns a layered representation in which every block is localized with a bounding box, classified by type (title, table, equation, signature, and others), and scored for confidence at both the page and word level.

Mistral says bounding boxes were its most-requested capability. The reason is straightforward: without location data, downstream systems cannot trace an extracted fact back to its source on a specific page. That traceability gap has been a persistent friction point for enterprises building retrieval-augmented generation (RAG) pipelines, compliance workflows, or any application where “where did this number come from?” is a question that needs an auditable answer.

Block classification addresses a related problem. A paragraph tagged as a “title” can segment a document into hierarchical chunks for semantic search. A block tagged as a “table” can be routed to a structured-data pipeline rather than a text summarizer. A block tagged as a “signature” can trigger a redaction workflow in a compliance system.

These are not novel ideas in isolation, but packaging them as first-class outputs of the OCR model itself — rather than requiring a separate layout-analysis stage — removes an integration layer that enterprise teams have historically had to build and maintain themselves.

The confidence scores serve a dual purpose. At scale, they allow organizations to programmatically route low-confidence regions to human reviewers and auto-approve high-confidence extractions, building what the industry calls human-in-the-loop verification without requiring a person to review every page of every document. In production systems, OCR is rarely the end goal — it is the first step in a larger pipeline.

Developers building RAG systems, agent workflows, or document automation often spend more time reconstructing layout and structure than on the downstream AI logic itself. OCR 4 aims to eliminate that reconstruction step, and if it delivers on that promise, the value accrues not just in OCR cost savings but in reduced engineering hours across the entire document pipeline.

Independent reviewers preferred Mistral’s output 72 percent of the time, but benchmarks tell a complicated story

Mistral reports that OCR 4 achieved a 72% average win rate in a head-to-head human evaluation against leading competitors, conducted by independent annotators across more than 600 real-world documents in over 12 languages. The model also achieved the top overall score on OlmOCRBench at 85.20 and scored 93.07 on OmniDocBench.

But the company itself urges caution in interpreting those numbers. In its release, Mistral took the unusual step of auditing and publicly disclosing the specific types of scoring artifacts it encountered, including ground-truth errors in the reference annotations, equivalent LaTeX notation scored as mismatches, column-reading-order assumptions, and header/footer attribution issues. “We therefore treat the aggregate score as directional rather than definitive,” the company said — a notably transparent stance from a vendor announcing a product.

That transparency is well-timed. On the public OlmOCRBench leaderboard, some researchers have noted that OCR 4 currently ranks third, behind open models like Chandra OCR 2. And some open-weight models self-report higher OmniDocBench composite scores — PaddleOCR-VL-1.6 claims 96.33 — though those results have not been independently reproduced on the public leaderboard.

Early enterprise feedback has been favorable nonetheless. Aidan Donohue, an AI engineer at financial AI firm Rogo, said the company benchmarked OCR 4 against leading agentic document parsers on a chart-dense financial QA dataset and “reached equivalent accuracy at roughly 8x lower cost and 17x lower latency.” Ivan Mihailov, an AI engineer at intellectual property management firm Anaqua, said OCR 4 is “roughly 4x faster per page than our incumbent provider.” 

Enterprise buyers, however, should run their own evaluations rather than relying on any vendor’s benchmark numbers. The practical question is not which model scores highest on a leaderboard, but which model produces the fewest errors on your specific documents, in your specific languages, at a price and latency that fit your workflow.

The Anthropic export ban gave Mistral’s sovereignty pitch the proof point it needed

Mistral’s release lands in a geopolitical context that could hardly be more favorable for its strategic positioning.

On June 12, Anthropic was forced to disable all access to its newest AI models, Fable 5 and Mythos 5, after the U.S. Commerce Department used national security export controls to bar the company from distributing the models to any foreign national. Enterprise clients in finance, healthcare, SaaS, and critical infrastructure found their core intelligence services abruptly disabled, without prior warning or effective recourse. As of June 24, both models remain offline, with prediction markets giving only 57% odds of restoration before July 1.

That episode validated a warning Mistral CEO Arthur Mensch has been sounding for over a year. As Business Insider reported, Mensch warned at London Tech Week in June 2025 about American AI companies “having the keys” for their models, calling it a scenario where European companies are “giving leverage to their providers.” He added: “At some point, you need to be able to turn it off or turn it on, and you don’t want to leave it to another country.”

The argument gained further urgency as Mensch’s broader sovereignty pitch escalated in recent months. As reported by CNBC in late May, Mensch told the outlet: “Europe is lagging behind when it comes to [the] buildout of infrastructure, and so we are investing to close that gap.” 

At the same time, Mensch pushed back against Pope Leo XIV’s call for AI to be “disarmed,” arguing that Europe cannot afford to fall behind U.S. tech giants. “We’re all for ​peace, but if you look at our rivals and adversaries in the world, they’re using artificial ​intelligence … we do need to have our own capabilities,” Mensch told reporters.

OCR 4’s single-container, self-hosted deployment model is the product-level expression of that argument. A U.S.-headquartered provider offering EU data residency means documents are stored in Frankfurt but governed by U.S. law. Mistral, incorporated in France and operating under EU jurisdiction, offering on-premise containerized deployment, means documents never leave the customer’s infrastructure at all. The EU AI Act’s fine enforcement provisions take effect August 2, adding regulatory pressure to the compliance calculus for European enterprises evaluating document AI vendors.

Baidu’s free, open-weight OCR model arrived one day earlier — and the contrast is revealing

Mistral’s release did not arrive in isolation. Just one day before OCR 4 launched, Baidu shipped Unlimited-OCR on June 22 — a 3-billion-parameter MIT-licensed model that tackles one of the most persistent pain points in document AI: parsing entire PDFs and multi-page scans in a single forward pass, without chunking the input or stitching the output back together afterward.

Baidu’s model uses a technique called Reference Sliding Window Attention (R-SWA) that, as a top Hacker News commenter explained, splits the AI’s focus into two paths: maintaining full attention on the original document image while restricting memory of generated text to a tight, moving window. The result is constant KV cache size and the ability to transcribe 40-plus pages in a single forward pass. The model gathered 1,800 GitHub stars in its first 24 hours and racked up more than 479 upvotes on Hacker News, where the discussion thread ran to 109 comments.

The two releases frame what some analysts are calling the June 2026 document-AI split: self-hosted long-horizon parsing with open weights versus structured managed extraction with enterprise features.

Baidu’s model is free under an MIT license, runs on standard GPU hardware, and has no managed API or enterprise SLA. Mistral’s model is a commercial product with per-page pricing, bounding boxes, confidence scores, block classification, multi-platform distribution, and self-hosted deployment options for enterprise customers. 

Unlimited-OCR may be the better tool for a research team digitizing scanned dissertations on a single GPU. OCR 4 is built for the IT procurement process — the world of SLAs, data processing agreements, and compliance audits.

Beyond Baidu, the broader OCR competitive field includes Google Document AI, Amazon Textract, Azure Document Intelligence, ABBYY Vantage, and a growing number of open-weight models. 

On the Hacker News thread for Unlimited-OCR, practitioners offered a candid assessment of the state of the art. Joss82, who has worked on document parsing for 10 years, wrote bluntly: “OCR still sucks in 2026.” Meanwhile, one user named SyneRyder reported success with Claude for OCR of hundreds of pages of handwritten documents, noting the model delivered results with “no corrections required” and even pointed out a continuity error in the source text. These practitioner reports underscore a key tension in the market: performance varies wildly depending on the specific document type, language, and quality of the source material.

The real play is not OCR — it is an enterprise AI stack with document intelligence as the on-ramp

Step back far enough, and Mistral’s OCR 4 release is not really an OCR story. It is an enterprise go-to-market story built on top of a $4.4 billion global intelligent document processing market that is forecast to grow at a 33.1% compound annual growth rate through 2030, according to Grand View Research.

For Mistral, OCR is a wedge into enterprise AI budgets. The model feeds directly into Mistral’s Search Toolkit, the company’s open-source composable search framework announced at the AI Now Summit. In that architecture, OCR 4 serves as the ingestion layer for retrieval-augmented generation and enterprise search pipelines, converting raw documents into citation-ready, structurally classified input. The logic is clear: once an enterprise adopts OCR 4 for document extraction, Mistral’s broader model suite — including Medium 3.5 for reasoning and the Vibe agentic platform for task execution — becomes the natural next step in the stack. 

That pipeline ambition is critical context for understanding Mistral’s current fundraising trajectory. Bloomberg recently reported that the company is in early discussions to raise about €3 billion ($3.5 billion) at a valuation of roughly €20 billion — nearly double the €11.7 billion valuation from its September Series C round. To date, Mistral has raised only about $4 billion, a fraction of what its largest U.S. rivals have taken in. OCR 4 and its associated enterprise revenue pipeline are part of how the company plans to justify that higher valuation, with Mistral targeting €1 billion in revenue for 2026, up from €200 million in 2025, according to Le Monde.

Mistral is a company with roughly 1,000 employees and ambitions to compete with labs that have raised 40 times as much capital. It cannot win a general-purpose model arms race against OpenAI and Anthropic. What it can do is build a differentiated enterprise stack around sovereignty, structured document intelligence, and agentic workflows — and use that stack to capture European enterprise budgets that are increasingly wary of U.S. provider dependency. 

The pricing structure reinforces that strategy: at $2 per 1,000 pages in batch mode, the cost of processing a 100,000-page corporate archive falls to $200, making large-scale digitization projects economically viable in ways they may not have been with token-based vision-language model pricing.

Whether Mistral can execute that vision at scale — against Google, Amazon, Microsoft, and a surging open-source ecosystem — remains an open question. But the Anthropic export control crisis is still unresolved, European data sovereignty regulations are tightening, and a potential €20 billion funding round is on the horizon. The company is holding an OCR 4 production webinar on July 7 at 6:00 PM CET.

Two weeks ago, the argument for building AI infrastructure outside the reach of U.S. export controls was theoretical. Then the U.S. government flipped a switch, and Anthropic’s most advanced models went dark for every non-American on the planet. Mistral did not cause that crisis — but it spent the last year building the product that makes it matter.

Stanford researchers will discuss their agentic ‘scientists’ that are on course to reshape drug discovery at VB Transform 2026

Drug discovery is notoriously inefficient. Pharmaceutical projects span years, moving from one specialized human team to the next through disconnected workflows that result in knowledge loss during each handoff. 

A shocking 90% to 95% of drug discovery projects reportedly fail — one of the highest failure rates of any industry. A single successful drug can take over a dozen years and up to $1 billion from initial discovery to patient distribution, according to published reports. 

Generative AI is being used to solve some of the challenges, but Stanford researchers have moved the ball forward with agentic AI. 

A team led by James Zou, associate professor of Biomedical Data Science at Stanford University, has deployed thousands autonomous AI “scientist” agents in a virtual biotech that simulates the full lifecycle of drug development. The agents handle everything from initial discovery through safety testing and clinical trial design, while maintaining the continuity that’s lacking in today’s drug discovery processes, according to Zou.

The project uses a hierarchical orchestration framework. At the top sits a chief scientist officer agent that acts as a planner, delegating tasks to teams of specialized agents, Zou told VentureBeat during a call ahead of his upcoming session at VB Transform 2026.

While one team of agents focuses on discovery, another manages safety, and others handle specialized analytical tasks. Because these agents operate within a unified, hierarchical ecosystem, they retain the full context of a project, maintaining continuity from the first molecule identified to the final clinical outcome.

The “brain” of the system relies on a vast amount of primary data. The agents are granted access to data sources ranging from genomics and FDA chemistry data to clinical trial databases using a model context protocol.

The team has invested heavily in agent-native and agent-friendly data, allowing the AI to synthesize complex information more effectively. The system relies on a combination of models, with Zou noting that while Claude often serves as the backbone for coding and data analysis, the architecture employs a mixture of models, including those fine-tuned specialized use cases.

Zou is raising money at a roughly $1 billion valuation for his startup, Human Intelligence, based on the research.

During Zou’s session at VB Transform on July 15, titled How 10,000 agentic scientists in Stanford’s lab are set to revolutionize medical research and discovery, he will share valuable insights including strategies for managing context and long-running, multi-step workflows in a multi-agent system, the process of transforming and indexing raw enterprise data to make it agent native, and how to use human auditing and experimental reward signals to verify agent actions.

Another session at VB Transform focused on the value of agentic context includes Building a trustworthy agentic AI foundation: How Zillow accelerated engineering by 40%, with Zillow’s SVP of engineering and technology, Toby Roberts and Glean’s CEO Arvind Jain. 

Interested in attending VB Transform 2026? Register here. A select number of complimentary passes are also available to senior technology leaders. Contact us to get yours.

Anthropic’s Claude Code Artifacts update brings live, shared dashboards and interactive workspaces to enterprises

Anthropic announced a potentially game-changing new feature for users of Claude Code on the Claude Team and Enterprise subscription plans: Artifacts.

This update turns a Claude Code session’s work into a live, interactive, and shareable, custom HTML webpage, allowing a Claude Code user to plug in live code, multiple data sources, and have it surface on an interactive URL that they can send to other teammates — be it a dashboard, an app design, or some other product meant for internal usage.

These teammates and the original user can watch the webpage it update in real-time as Claude Code goes about its work autonomously or under the user’s guidance, and as the connected data sources and codebases change.

While Anthropic first introduced Artifacts to its consumer web chatbot in the summer of 2024—where it evolved from a manual toggle feature to a generally available tool for publishing code snippets and games to the web—integrating this capability directly into the Claude Code command-line interface (CLI) and desktop app bridges the gap between deep, back-end engineering and the non-technical stakeholders who need to understand it.

Product and Technology: The End of the Status Update

At its core, Claude Code Artifacts acts as a dynamic translation layer. Built directly from the unbroken context of a user’s session, the agent uses the local repository codebase, connected monitoring tools, and conversational reasoning to spin up specialized web pages.

Engineers no longer need to wire up external data sources or stand up temporary infrastructure; the AI builds the UI from what already exists.

Crucially, these web pages are not static exports. As the AI works through a terminal session, the open webpage refreshes in-place, updating charts and text instantly at the exact same URL. Every update publishes a new version history, allowing teammates to roll back or track the agent’s progress securely on desktop or mobile.

The Battle of Live, Interactive, Shared AI Work Surfaces: Anthropic’s Claude Code Artifacts vs. OpenAI’s Codex Sites

Anthropic’s update comes more than two weeks after OpenAI released a massive update to its own Codex platform, introducing a strikingly similar enterprise hosting feature called “Sites”.

This tit-for-tat product cadence highlights a rapidly escalating battle over the enterprise workspace across functions and beyond developers themselves, though there are some important technical and philosophical distinctions worth pointing out for enterprises considering either.

As revealed in their respective developer documentation webpages, OpenAI is building a platform-as-a-service; Anthropic is building a stateless canvas.

OpenAI’s Sites is designed to generate durable, full-stack web applications. According to the platform’s documentation, Codex Sites hosts projects that output as Cloudflare Worker-compatible ES modules.

Crucially, Sites supports persistent backend infrastructure: agents can automatically wire up “D1” relational databases for structured data (like user progress or saved records) and “R2” object storage for file uploads. An OpenAI Site can support public sign-ins, integrate with external identity providers, and allows for highly specific access controls tailored to specific workspace groups.

It utilizes a two-stage publishing process—saving a reviewable candidate linked to a Git commit before officially deploying to production. In short, it is a production environment designed to replace functional internal SaaS tools.

Anthropic’s Claude Code Artifacts, by contrast, deliberately avoids the backend. The newly released documentation is blunt about its limitations: “An artifact is a capture of work, not an application”.

Each Artifact is a single, self-contained HTML page capped at a rendered size of 16 MiB. To guarantee organizational security, Claude wraps the published file in a strict Content Security Policy (CSP) that blocks all external network requests. T

his means the page cannot load external scripts, fonts, or stylesheets, and fetch, XHR, and WebSocket calls are completely blocked. All CSS and JavaScript must be inlined, and images must be embedded as data URIs. Artifacts cannot store form input, call an API at view time, or serve multiple routes.

This technical limitation is actually Anthropic’s deliberate philosophical position: While OpenAI wants to spin up persistent software portals for the whole company, Anthropic is keeping Claude Code firmly anchored in ephemeral, highly secure technical workflows. Claude Artifacts are not meant to be software; they are meant to replace whiteboard diagrams, manual bug walkthroughs, and status reports with secure, self-updating visual tools that never leak live data outside the corporate boundary.

Licensing and Enterprise Security: Keeping the Codebase Private

Because these agents sit at the nexus of proprietary company data and live codebases, licensing and access controls are a primary concern.

Both Anthropic and OpenAI have opted for closed, proprietary licensing models for these new visual workspaces. For end users and developers, the distinction is critical. Unlike permissive open-source software (such as MIT or Apache 2.0) or strict copyleft licenses (like GPL)—which grant developers the legal freedom to inspect, modify, and self-host the underlying code—neither Claude Code Artifacts nor Codex Sites can be independently forked or hosted.

Enterprise clients do not maintain code-level ownership over Anthropic’s rendering engine or Codex’s integration nodes; both operate strictly within their respective creators’ managed infrastructures.

To make this vendor-managed approach palatable to enterprise compliance teams, both companies have heavily prioritized organizational security. Anthropic ensures every artifact is private to its author by default and strictly cannot be made public to the broader internet. When an engineer chooses to share a link, it is viewable exclusively by authenticated members of their specific organization. System administrators retain ultimate authority, managing access through org-level toggles, role-based scoping, and explicit retention policies, while maintaining oversight through a centralized compliance API.

OpenAI takes a similarly gated approach with Codex Sites, rolling the feature out primarily for ChatGPT Business and Enterprise workspaces. Like Anthropic, OpenAI relies on system administrators to manage deployment through centralized workspace settings, requiring an admin to explicitly enable Sites via role-based access control (RBAC) for Enterprise tiers.

However, because Codex Sites functions more like a hosted web application, its access controls are slightly more granular. When an engineer prepares to share a deployed URL, they can apply specific access modes: restricting the site to just themselves and workspace admins, opening it to all active users in the workspace, or limiting access to custom user groups.

Furthermore, to prevent sensitive data leaks, OpenAI provides a dedicated Sites panel to manage runtime environment variables and secrets securely, ensuring those keys do not have to be committed to local source files.

Reactions and Reflections

The introduction of visual, self-updating UI layers to command-line agents is fundamentally altering how developers view their own workflows. As AI handles the raw syntax and automates the reporting, the friction of communicating technical work to stakeholders is vanishing.

Boris Cherny, the Lead and creator of Claude Code, highlighted the sheer utility of the update in a post on X earlier today:

“I’ve been using Artifacts in Claude Code for everything: visual explanations of tricky code, system diagrams, quick previews of a few animation options, data analyses and dashboards I share with the team,” Cherny wrote. “They are a game changer for how I work with Claude. Can’t wait to hear what you think!”

This sentiment is practically demonstrated in Anthropic’s launch materials. In one scenario, an engineer prompts Claude Code to investigate user drop-offs since a previous software release.

In a matter of seconds, the agent executes an SQL read, builds an interactive drop-off funnel dashboard, and diagnoses that “Pro accounts stall at the export sheet”. The AI then proposes UI fixes, updates the live charts as the code is refactored, and generates a secure link that a manager can instantly open via mobile.

By turning the terminal into a live, collaborative canvas, Anthropic is proving that the most valuable output of an AI coding assistant isn’t just the code itself—it is the context, the reasoning, and the ability to share that work instantly.

AWS enters the context layer race with a graph that learns from agents, not manual curation

Building a context layer between enterprise data stores and AI agents is bespoke work, with no standard service to automate or maintain the graphs over time. Amazon is making a direct play to change that.

Amazon on Wednesday entered the space, announcing a series of three products it’s positioning as a context intelligence stack for AI agents. The centerpiece is AWS Context, a new knowledge graph service that gets smarter through agent usage over time. AWS also announced the general availability of Amazon S3 Annotations and a preview of skill assets in AWS Glue Data Catalog.

The context layer is now a contested architectural category with no shortage of options from different vendors. AWS is entering that market with a different architectural premise: that the graph should learn from how agents use it automatically, without human re-curation.

“Your agents now get smarter without you having to rebuild anything from scratch,” said Swami Sivasubramanian, vice president of Agentic AI at AWS, during his AWS Summit NYC keynote.

“This service automatically builds a knowledge graph from all your existing data,” he said. “This service infers relationships across your data sets, business rules, and domain knowledge, and makes all of it available to your agents and your organization at runtime.”  

AWS Context builds a self-learning knowledge graph from existing data

It’s a problem AWS says it has seen repeatedly in customer deployments.

AWS Context maps relationships across existing data automatically: what tables exist, what columns mean, how sources relate and which sources are authoritative. It combines semantic search with graph-level reasoning and infers relationships across datasets, business rules and domain knowledge, making all of it available to agents at runtime.

“The knowledge graph improves itself over time as it learns which sources produce correct results and which parts get used,” Sivasubramanian said. 

Data stewards manage the graph through the AWS Management Console, reviewing inferred relationships, promoting them to production and attaching business definitions and usage rules. Every query inherits the calling user’s IAM and Lake Formation permissions, making agent data access auditable by identity through controls enterprises already rely on.

All metadata is published in Apache Iceberg format to Amazon S3 Tables, queryable via Athena, Redshift, Spark or any Iceberg-compatible engine, with no proprietary APIs. Third-party catalog connections are supported, so context from systems outside AWS can be pulled into the same graph. Agents query through agentic search APIs and MCP tools across Bedrock AgentCore, EKS or any MCP-compatible framework.

Context is more than just a single service

Context is a complicated space and AWS is layering multiple services to help enterprises build context across the data stack.

Amazon S3 Annotations. This service enables users to attach rich business context at the storage layer, directly to individual S3 objects. 

AWS Glue Data Catalog skill assets. Glue skill assets attach domain knowledge at the catalog layer, linking runbooks, query patterns and usage rules to data assets across the estate. 

AWS Context then synthesizes both into the knowledge graph that agents query at runtime, combining semantic search with graph-level reasoning across structured and unstructured sources. Each layer feeds the next.

AWS is entering a highly competitive context space

Snowflake announced its context approach earlier this month with its Horizon Context and Cortex Sense services. Microsoft is providing context via its Fabric IQ platform that provides a semantic ontology for data. Redis has developed a context platform that optimizes data for retrieval. Vector database vendor Pinecone has its Nexus context offering that compiles enterprise data into task-specific artifacts before agents ever query them.

AWS’s structural argument is straightforward: for enterprises already running S3, Glue and Lake Formation, AWS Context extends an existing identity model with no data movement required. The pitch is zero-integration friction — not just cost consolidation.

“Context makes agents more powerful and as the whole world is building agents, every agentic platform vendor needs a context capability,” Holger Mueller, VP and Principal analyst at Constellation Research, told VentureBeat.

Mueller noted that AWS is no exception. “The concern — as with all context offerings — is going to be performance, especially for transactional data,  we will see,” he said.

Databricks says it solved the decades-old data pipeline problem that’s been slowing AI agents

For decades, data professionals have struggled with the challenge of managing both operational and analytical databases in a unified approach that doesn’t introduce latency and performance degradation.

Agents made the problem structural. A system that reasons continuously and acts on live data cannot tolerate a pipeline between itself and the information it needs to act on.

At the Data + AI Summit on Tuesday, Databricks announced two products aimed at collapsing that infrastructure. Lakehouse//RT delivers millisecond query latency directly on governed Delta and Iceberg tables, eliminating the dedicated real-time serving tier that enterprises have maintained alongside their lakehouses. LTAP, short for Lake Transactional/Analytical Processing, stores Postgres-native transactional data in Delta and Iceberg format from the point of write, removing the ETL pipelines that have connected operational and analytical systems for decades.

Reynold Xin, co-founder of Databricks, described a simpler data stack as “the holy grail for agents” in a briefing with VentureBeat, arguing that as users vibe code more applications, the agents reasoning analytically on top of those apps need the underlying infrastructure out of the way to move fast. 

“The agents really prefer a much simpler stack, because they can move way faster,” he said.

LTAP bets on storage-layer unification where HTAP tried engine convergence

Many vendors have tried various approaches over the decades to unify analytical and transactional data.

Back in 2014, analyst firm Gartner coined the term HTAP, an acronym that stands for Hybrid Transactional/Analytical Processing as a way to describe  vendors that attempted to unify the two types of databases. Vendors including MemSQL (now known as SingleStore) SAP HANA and Oracle’s MySQL Heatwave are among many HTAP vendors in the market.

LTAP is Databricks’ answer to HTAP, using the Lakebase architecture to unify data at the storage layer rather than the engine level. Lakebase is Databricks’ serverless cloud-based PostgreSQL database service that became generally available in February.

“HTAP to us is kind of more of a failure of the industry rather than a success,” Xin said. 

The LTAP approach goes to the storage layer instead of the query layer. Lakebase previously stored Postgres data in Postgres format on object storage, requiring conversion before the Lakehouse’s analytical engines could use it efficiently. With LTAP, transactional data lands directly in Delta or Iceberg format, sharing the same copy that analytical workloads read. Postgres remains the transactional engine. Spark and the Lakehouse remain the analytical engine.

“The whole point is, hey, you use the best tool for the job at the query engine level, we just make sure underlying storage is a single copy of the data,” Xin said.

The central engineering challenge is latency. Object storage carries response times in the seconds range, far too slow for OLTP workloads that require sub-millisecond performance. Lakebase handles this through a caching layer between Postgres compute instances and object storage. The key design decision is where the column conversion happens: idle CPU capacity in that caching layer performs the row-to-column conversion before data lands in object storage. 

“When you convert data from row to column, it compresses more than 10 times, typically, so now you substantially reduce the network cost of that basic caching layer between that caching layer and the object stores,” Xin said.

Lakehouse//RT delivers millisecond query latency on live lakehouse data without a separate serving tier

Lakehouse//RT is Databricks’ answer to the dedicated real-time serving tier — the separate system enterprises have maintained alongside their lakehouses to handle low-latency queries, at the cost of data copies, split governance and pipeline complexity agents cannot work around. Key capabilities of Lakehouse//RT include:

Reyden compute engine: Built specifically for high-concurrency, low-latency serving, Reyden queries Delta and Iceberg tables directly without moving data out of the lakehouse.

Latency and throughput: Lakehouse//RT delivers sub-100ms latency at 12,000 queries per second, with response times as low as 10ms on smaller datasets and up to 16x better performance than existing dedicated serving stacks.

Governance and data access: Every query runs within Unity Catalog’s governance framework with no separate permissions layer, no data copies and no ingestion pipelines.

Analysts see the agentic framing and open format approach as the real differentiators

The problem both products address is well-documented among enterprise data teams, but analysts draw a distinction between the pain point and the specific claim Databricks is making.

“Enterprises have had HTAP, streaming, cloud warehouses, and operational stores for years,” Stephanie Walter, Practice Leader for AI Stack at HyperFRAME Research, told VentureBeat. “What is different is the agentic AI framing.”

Walter noted that agents need live operational data, historical context, governance, retrieval, and write-back in the same workflow. 

“That is a strong architecture argument, but Lakebase still has to prove it can meet the latency, reliability, and operational maturity CIOs expect,” she said.

Mike Leone, analyst at Moor Insights and Strategy, said the path to genuine differentiation is more specific than the unification concept itself. He also noted that open analytics on a data lake is table stakes now, with many vendors providing some sort of service.

“The less common move is letting the transactional writes land in open formats too, so the operational database isn’t sitting in a proprietary box while only the analytics half is open, “Leone told VentureBeat. 

He added that the open format approach, paired with Lakehouse//RT querying live data directly off the lake, is what gives the architecture a credible case for retiring a whole row of specialized systems.

The technical claim that will face the most scrutiny is also the most central one. “The piece I’d still want their engineers to walk through is how both engines truly share one copy without a quiet conversion step doing the syncing in the middle,” Leone said.

What this means for enterprises

For data engineers evaluating their stack for agentic workloads, the question is no longer which best-of-breed tool to run for each job — it’s whether running separate tools at all is still defensible.

Enterprises that built separate operational databases, real-time serving tiers and analytical lakehouses could previously treat the gaps between them as a maintenance burden. Agents surface those gaps as an operational risk: a system reasoning across governance boundaries will find the inconsistencies faster than any human team.

The market is moving away from specialized serving layers faster than most vendor roadmaps anticipated. According to VB Pulse Q1 2026, a three-wave longitudinal survey of 100-plus employee organizations, hybrid retrieval intent tripled from 10.3% to 33.3% across the quarter while standalone vector database adoption declined across every tracked vendor. The same consolidation logic is now hitting the real-time serving tier.

The traditional approach — best-of-breed tools for each workload type, pipelines between them — was built for human-speed analytical consumption. Agent workloads don’t tolerate that architecture.

“The pain they’re pointing at, all the copying and syncing between operational and analytical systems, is real and expensive, and anyone running this at scale feels it,” Leone said.

Satya Nadella warns that AI could hollow out entire industries, echoing the damage done by globalization

Microsoft CEO Satya Nadella published a sweeping essay on Sunday laying out what he describes as the defining economic challenge of the AI era: the risk that a handful of frontier models will absorb the expertise of entire industries and commoditize it, leaving businesses stripped of their competitive moats.

“The last thing any of us want is a world where every company across every sector is ceding value to a few models that eat everything they see,” Nadella wrote in the piece, titled “A frontier without an ecosystem is not stable,” which he posted on X. “If all the value is accrued by only a few models, the political economy will simply not tolerate it. There is no societal permission for an AI future that hollows out entire industries.”

The essay is unusually philosophical for a sitting CEO of a $3 trillion technology company. But it arrives at a moment when the theoretical risks Nadella describes are becoming tangible — and, critically, when Microsoft itself is grappling with the very dynamics he warns about.

Nadella introduces “token capital” as the new currency of enterprise AI strategy

At the center of Nadella’s essay sits a conceptual framework built on two pillars he calls “human capital” and “token capital.” Human capital, he writes, “comprises the knowledge, judgment, relationships, ingenuity, and pattern recognition of its people,” while token capital refers to “the firm’s AI capability it builds and owns.”

The two are not in tension, he insists. “Importantly, human capital does not become less valuable as token capital grows. It only becomes more valuable!” he writes. “I believe human agency will be the driver of token capital growth. Humans will set ambitious goals, connect dots across domains, build relationships, and recognize patterns that matter most. Without human direction, you have compute running in circles.”

This framing is a deliberate counterweight to the narrative that AI will simply replace human workers or, at the enterprise level, dissolve the intellectual property that differentiates one company from another. Nadella is arguing that the real danger is not AI’s capability but its tendency to centralize — and that the solution requires a fundamentally new architecture for how businesses interact with the technology.

He describes the real opportunity as “not in picking the best model but instead in building a learning loop on top of models where human capital and token capital compound.” The key test of a company’s sovereignty in this new era, he writes, is whether it can “switch out a ‘generalist’ model without losing the ‘company veteran’ expertise built into their learning system.”

This is the essay’s most actionable claim — and its most provocative. Nadella is telling enterprises they need to decouple their institutional intelligence from whatever frontier model they happen to be running, creating portable knowledge systems that survive vendor changes.

Why Nadella is comparing AI concentration to the outsourcing crisis that gutted industrial economies

Nadella draws a pointed historical parallel to make his warning concrete. “Think about what happened in the first phase of globalization where entire industrial economies were hollowed out by outsourcing,” he writes. “The GDP numbers looked fine on the surface, but the displacement was real and the consequences are still being felt. Let us not bring that dynamic into the AI era, with a small number of AI systems capturing all the economic returns, while entire industries find their knowledge commoditized right out from underneath them.”

The globalization analogy is not accidental. It reframes the AI concentration debate from a narrow technology question into a political-economy argument — one that regulators, policymakers, and voters can grasp. By invoking the social costs of offshoring, Nadella is signaling that the stakes extend well beyond the enterprise technology stack. He is warning that if the AI industry fails to distribute value broadly, the political system will intervene to force the issue.

“In my view, our priority has to be building a frontier ecosystem, not just a frontier model, so value flows broadly across every company, every industry, and every country,” he writes. He grounds this in an older platform philosophy: “This is the ethos I’ve grown up with where platforms enable more value on top than is captured inside, and where every company can continuously innovate and build value of its own.” It is a direct echo of the Windows-era argument, updated for the age of inference — and it carries a similarly self-interested subtext, given that Microsoft’s cloud business sits squarely in that platform layer.

Microsoft’s own runaway AI costs reveal the gap between Nadella’s vision and operational reality

What makes Nadella’s essay so striking is its timing. He published it on a day when Reuters reported that Microsoft shareholders filed a proposed class-action lawsuit in Seattle federal court, accusing the company of inflating its stock price by failing to disclose slowing growth in its Azure cloud business and the need to spend billions of dollars on AI infrastructure. The suit names Nadella and Chief Financial Officer Amy Hood among the defendants.

As the Yahoo Finance report on the lawsuit noted, Microsoft allegedly “aggressively promoted its AI developments, specifically its ‘Copilot’ assistant and close financial alliance with ChatGPT creator OpenAI, to artificially boost investor optimism,” while understating infrastructure strain and capital risks. Microsoft also reported $37.5 billion of capital spending in its second quarter, up nearly 66% from a year earlier and above the $34.3 billion that analysts projected.

Microsoft’s internal cost pressures around AI have surfaced in other concrete ways this year. The company is canceling the majority of its internal Claude Code licenses in its Experiences and Devices division, effective June 30, 2026. Monthly usage rates reached 84 to 95% by April 2026, and per-engineer API costs ranged between $500 and $2,000 monthly, according to Windows Forum. The cancellation came after Microsoft exhausted portions of its annual AI budget due to token-based billing, as Fortune had reported in May.

The Claude Code episode illustrates, at the micro level, the exact dynamic Nadella describes at the macro level. When a company’s AI usage is metered by the token — the fundamental unit of compute that powers model inference — the more productive the tool becomes, the more expensive it gets. The term “token capital” in Nadella’s essay carries a double meaning: it refers both to a firm’s proprietary AI capability and, implicitly, to the actual tokens consumed in running it. Building a learning loop that compounds is aspirational. Paying the bills for that loop is operational reality.

Uber, Meta, and Amazon are all hitting the same AI spending wall — and it validates Nadella’s warning

Microsoft is not alone in this bind. Uber burned through its entire 2026 AI coding tools budget in just four months after incentivizing employees to adopt the technology through an internal leaderboard ranking teams by total AI tool usage. Uber has since instituted a monthly $1,500 cap per employee per agentic coding tool, according to TechCrunch. At Meta, an employee created a leaderboard called “Claudeonomics” to track which workers consumed the most AI tokens. Amazon, meanwhile, has pushed employees to “tokenmaxx” — use as many AI tokens as possible.

The emerging pattern is clear: enterprises adopted AI coding tools aggressively, saw genuine productivity gains, and then discovered that the consumption-based economics of frontier models created budget crises that traditional software licensing never would have. Bryan Catanzaro, vice president of applied deep learning at Nvidia, captured the tension bluntly in an interview with Axios: “For my team, the cost of compute is far beyond the costs of the employees,” he said.

These cost dynamics land differently in the context of Nadella’s essay. He prescribes a three-layer architecture — evaluation, reinforcement learning, and retrieval — designed to sit between a company’s workforce and whatever frontier model it subscribes to. Companies, he argues, need to build “private evals” that “capture whether a model is actually improving against outcomes that matter to the business (not just external benchmarks!),” alongside “private reinforcement learning environments” that “let models grow stronger on real traces from inside the organization” and a knowledge base that “makes institutional memory queryable and use of tokens more efficient.” He calls the resulting system “a hill climbing machine” that, “unlike most assets, it compounds.”

Other Big Tech CEOs are echoing Nadella’s fears about AI models devouring enterprise knowledge

Nadella’s concerns do not exist in isolation. Other technology leaders have been raising similar warnings throughout 2026, though none have offered as prescriptive a response.

Snowflake CEO Sridhar Ramaswamy warned in a February podcast that the biggest software companies risk being reduced to mere data sources. “The big model makers want to create a world in which all of the data for all of the enterprises is easily available to them,” Ramaswamy said, describing everything else as “a dumb data pipe that feeds into that big brain.” He added that Snowflake needs to operate with a “fear” that enterprises would abandon software-specific AI agents in favor of all-inclusive agents that hoover up data from everywhere.

Box CEO Aaron Levie struck a similar note in a January LinkedIn post. AI models can now perform high-level knowledge work across nearly every profession, from law to strategy to scientific research, he argued. “The question that we will have to wrestle with is, in a world where everyone has access to the same expert intelligence, how does a company differentiate?” he wrote.

The combined effect of these statements is a shared diagnosis from three very different corners of the enterprise technology market: the current trajectory of AI development threatens to collapse competitive differentiation across entire industries. Nadella’s essay stands apart from the others because it moves beyond diagnosis and proposes a specific architectural remedy. But the prescription is impossible to separate from the prescriber’s interests.

Microsoft sits in precisely the platform layer that Nadella’s framework would make indispensable — the company builds its own frontier models, operates the cloud infrastructure those models run on, and maintains deep partnerships with the leading independent AI labs. A world in which every enterprise builds a proprietary learning loop on top of commodity foundation models is, conveniently, a world in which Microsoft sells the picks and shovels to all of them.

Nadella’s Scout controversy and shareholder lawsuit reveal the tension inside Microsoft’s own AI strategy

The essay also arrives just ten days after Nadella publicly rebuked one of his own executives for outlining a plan to “make people addicted” to a new AI tool called Scout.. Microsoft corporate vice president Omar Shahine had written an internal memo describing a three-phase plan to transform Scout “from addictive app to agentic platform,” with the first phase focused on features that “make people depend on it daily.” Nadella responded on an internal message board: “This is absolutely a non-goal! If anything we are doing the exact opposite. We want to make sure AI empowers and adds real value to human endeavor and broad economic growth!”

The Scout incident and Sunday’s essay together suggest Nadella is actively constructing a public philosophy of AI that emphasizes broad value creation over extractive engagement — whether or not every corner of Microsoft has internalized that message. One anonymous Microsoft employee told 404 Media, as the Post reported, that the leaked Scout document was “very troubling,” adding: “It feels like one of those ‘saying the quiet part out loud’ moments.”

For technical decision-makers evaluating Nadella’s essay, the practical implications are significant. He is arguing that choosing an AI model matters less than building the learning infrastructure around it. He is arguing that the ability to swap models without losing institutional intelligence is the critical test of AI sovereignty. And he is warning that companies that fail to build these systems will find their expertise absorbed and commoditized by the models themselves. “You can offload a task, or even a job, but you can never offload your learning,” Nadella writes. “The future of the firm is the ability to compound that learning across people and AI.”

The question Nadella’s essay cannot answer is whether Microsoft will practice what its CEO preaches

Whether Nadella’s vision materializes depends on a question his essay carefully sidesteps: whether the platform providers who build and host the frontier ecosystem will resist the temptation to capture the value flowing through it. Nadella insists that “platforms enable more value on top than is captured inside.” But Microsoft’s own trajectory this year — the ballooning capital expenditures, the Claude Code budget crisis, the shareholder lawsuit alleging concealed costs, the internal memo about making users addicted — suggests the economics of restraint are harder than the philosophy of restraint.

Nadella ends his essay with the claim that broad value distribution “is the stable equilibrium we should build together.” He may be right. Ecosystems have historically outperformed walled gardens over long time horizons. But stable equilibria require every major player to forgo short-term extraction in favor of long-term compounding — and right now, the AI industry is burning through budgets in four months and spending 66% more on infrastructure than analysts expected. The CEO of the world’s most valuable technology company has written an eloquent argument for why the AI economy needs to work differently. The open question is whether his own company’s balance sheet will let him prove it.

PixelRAG beats text parsers on accuracy and cuts AI agent token costs 10x

Most enterprise RAG pipelines start the same way: a text parser converts web pages and documents into plain text so they can be chunked and indexed for retrieval. That conversion step destroys retrieval signals — and according to new research, it’s responsible for the majority of wrong answers.

A research team from UC Berkeley, Princeton University, EPFL and Databricks published a paper this week introducing PixelRAG, a system that skips that conversion entirely. Instead of parsing pages into text, PixelRAG renders them as screenshots, indexes those images and feeds retrieved tiles directly to a vision-language model reader. Tested across 30 million screenshot tiles covering all of Wikipedia, it outperforms text-based RAG across six benchmarks, improving accuracy by up to 18.1% over text-based baselines.

Parsers are the wrong place to look for fixes, according to the research team.

“Improving parsers is an endless process because every website requires special handling,” Yichuan Wang, lead author and UC Berkeley doctorate student, told VentureBeat.  “Our goal was to explore whether recent advances in VLMs make it possible to bypass that entire problem and build a retrieval system that works across websites without site-specific engineering.”

HTML parsers destroy the retrieval signals that enterprise RAG depends on

The goal of the researchers was to develop a clean end-to-end architecture.

“Modern web RAG pipelines often involve rendering, parsing, cleaning, chunking, and many other handcrafted stages,” Wang said. “Every stage introduces potential cascade errors and abstractions that move us further away from the original webpage. We were interested in whether we could eliminate most of that complexity and operate directly on the rendered page.”

Wang also noted that parsing inevitably loses information. Images, visual hierarchy, typography, emphasis (e.g., bold text), tables, and layout are either discarded or converted into imperfect textual approximations. 

“No matter how good a parser becomes, some information is fundamentally lost during the conversion,” he said.

The research identifies three ways text-based RAG loses the answer before it reaches the reader. All three were measured on SimpleQA, a standard benchmark of 1,000 factual Wikipedia questions:

  • Parser loss (36.6% of failures). HTML-to-text conversion destroys structured content so completely that no text chunk in the corpus contains the answer.

  • Rank loss (55.2% of failures). The answer exists in the corpus but gets outranked by keyword-dense infoboxes that land at rank 1 for 75.9% of queries, pushing answer-bearing paragraphs to rank 20 or lower.

  • Reader loss (8.2% of failures). The correct content reaches the reader but flattened structure causes misattribution.

How PixelRAG works 

Unlike a standard LLM that reads only text, a vision-language model takes images as input alongside text, meaning it can read a rendered web page the way a human does, with layout and structure intact. “For many structured information extraction tasks, we believe modern VLMs have an inherent advantage because they can reason jointly over both content and layout rather than relying on a flattened text representation,” Wang said.

PixelRAG is built around that principle, replacing the text parsing pipeline with a four-stage system that operates entirely on rendered screenshots.

  • Rendering. Pages are rendered using Playwright, a browser automation library, at a fixed 875-pixel viewport and sliced into 1024-pixel-tall tiles. Wikipedia’s 7 million articles produce roughly 30 million tiles. Assets are cached locally and rendered entirely offline.

  • Indexing. Each tile is encoded as a single 2048-dimensional vector using Qwen3-VL-Embedding-2B and stored in a FAISS approximate nearest-neighbor index. The full index runs to approximately 120 GB in fp16 and supports incremental updates without full re-indexing.

  • Training. The retrieval model is fine-tuned on synthetic contrastive data generated from the datastore, using dynamic hard-negative mining to filter false negatives. LoRA, a lightweight fine-tuning method that updates a small fraction of model weights, is applied to both the language model backbone and the visual encoder. Training on approximately 40,000 pairs completes in under three hours on a single H100.

  • Storage. Raw screenshot tiles for Wikipedia require 5.6 TB, but a render-on-demand approach eliminates persistent storage: embed all tiles, delete the screenshots and re-render pages on demand at query time. The vector index requires approximately 120 GB.

Six benchmarks, 10x agent token savings and one unsolved problem

Researchers tested PixelRAG across six benchmarks spanning factual Wikipedia QA, table-based queries, multimodal QA and live news retrieval. They said it outperformed text-based RAG on all six, including on tasks where questions are answerable from text alone. On SimpleQA it reaches 78.8% accuracy versus 71.6% for the strongest text parser, widening to 48.8% versus 42.5% on structured table queries. Teams need Qwen3-VL-4B class models or above to see the benefit. Smaller models trail text retrieval by more than 12.5 percentage points.

The agent cost advantage is the strongest near-term case for PixelRAG. In benchmark testing, an AI agent using PixelRAG as its search backend ran on 3.6 million prompt tokens versus 37.5 million for text retrieval, at 2 to 4 times lower cost than alternatives including Google, while achieving higher accuracy. Image compression can cut that token budget by a further third.

Visual chunking is the main unsolved problem. Text-based RAG systems have spent years refining how to split documents into meaningful retrieval units based on topic, section or semantic content. PixelRAG currently has no equivalent: it slices pages by fixed pixel height, meaning a table or paragraph can get cut in half mid-tile with no awareness of content boundaries. 

“The text retrieval community has spent years studying chunking strategies, while visual retrieval has received much less attention,” Wang said. “We think this is an important area for future research.”

What this means for enterprises

The retrieval quality problem PixelRAG addresses reflects a broader market shift already underway. VB Pulse Q1 2026 data from qualified enterprise respondents found intent to adopt hybrid retrieval tripling from 10.3% in January to 33.3% in March, the fastest-growing strategic position in the dataset. PixelRAG’s own authors point to hybrid deployment as the most practical near-term path — layering visual retrieval on top of existing text systems rather than replacing them.

For teams already running RAG pipelines, the path to those savings is more straightforward than a ground-up rebuild.

“A practical path is to use PixelRAG as an enhancement layer alongside existing text retrieval systems,” Wang said. “Hybrid retrieval that combines both text and visual search is straightforward and is likely how many production deployments would evolve.”

Context compression finally works in production: new research cuts LLM input 16x without the accuracy hit

Context windows are becoming a computational bottleneck. The longer an agent runs, the more tokens accumulate from retrieved documents, reasoning traces and conversation history, and the more memory and compute that growing context demands. Most existing solutions either degrade model accuracy, require the full context to load before compression begins, or produce memory savings that don’t translate into real speedups in standard serving infrastructure.

A research team from NYU, Columbia, Princeton, University of Maryland, Harvard and Lawrence Livermore National Laboratory published a paper this week that proposes a novel fix. The researchers introduce the concept of  Latent Context Language Models, or LCLMs, a family of encoder-decoder compression models that compress input context before it reaches the decoder. The models are open-sourced on HuggingFace.

Unlike KV cache compression methods — the dominant approach in the field, which still materialize the full KV cache before evicting entries — LCLMs compress the input token sequence before decoder prefill, so higher compression ratios directly reduce decoder-side compute and memory. The paper reports LCLMs at 16x compression produced output 8.8 times faster than KV cache baselines on the RULER long-context benchmark.

“These ballooning contexts take up memory and compute, and they are becoming a computational bottleneck for LLMs,” Micah Goldblum, co-lead advisor on the project and a researcher at Columbia University, told VentureBeat. “Our goal was to train language models end-to-end that can handle very long contexts efficiently and accurately. If you can make such a language model, everything becomes cheaper and faster.”

What LCLMs can do

LCLMs let models process much longer contexts than would otherwise be practical, at a fraction of the memory and compute cost, without the accuracy degradation that makes most compression methods a poor tradeoff in production.

At 4x compression, the paper reports accuracy of 91.76% on the RULER benchmark, compared to 94.41% with no compression at all. That is less than a 3 point drop for cutting context to a quarter of its original size. At 16x compression, where 93.75% of input tokens are removed, accuracy fell to 75.06%. Every KV cache method tested at the same compression ratio scored lower.

The gains hold on shorter inputs too. On GSM8K math word problems, where the full prompt is compressed rather than just retrieved documents, LCLMs outscored every other method tested regardless of compression ratio.

How it was built

The architecture pairs a 0.6B encoder with a 4B decoder. The encoder compresses blocks of input tokens into shorter sequences of latent embeddings. The decoder processes those in place of the original tokens. Training ran across more than 350 billion tokens.

The training recipe mixes three data types:

  • Continual pre-training data with compressed and uncompressed spans interleaved throughout

  • Supervised fine-tuning data covering reasoning and long-context tasks

  • An auxiliary reconstruction task that pushes the encoder to retain fine-grained detail

The combination addresses a tradeoff that limited earlier compression work, where preserving reconstruction accuracy came at the cost of general task performance.

An architecture search identified the optimal configuration. The paper found that scaling the decoder matters more than scaling the encoder.

Where it fits in an agentic stack

An LCLM is not an abstract research concept. It is designed to work with an existing stack. “You can simply swap out LCLMs for any existing LLM,” Goldblum said. “Whenever you retrieve data such as documents and want to dump it into your model’s context, simply run those documents through the LCLM’s compressor first.”

He noted that in the research paper, the researchers demonstrated how to build agents that selectively decompress useful text. 

“Think about this like a human skimming content before zooming in on relevant details,” Goldblum said.

Goldblum also cautioned that teams integrating the approach into existing agentic pipelines will need to tune their RAG systems accordingly.

“We also haven’t worked on online compression of reasoning traces,” he said. “The naive approach of just occasionally compressing the trace while generating it might work, but that remains to be determined.”

What this means for enterprises

Context windows are growing faster than inference infrastructure can keep up, and enterprises are already spending to fix it. VB Pulse Q1 2026 survey data from 100-plus employee organizations shows hybrid retrieval adoption intent tripling from 10.3% in January to 33.3% in March. Retrieval optimization overtook evaluation as the top investment priority by March, reaching 28.9% of qualified respondents.

Three things stand out for teams evaluating production fit:

  1. Inference cost scales with context length. At 1 million tokens, uncompressed inference with standard KV cache methods runs out of memory on a single H200 GPU. The paper reports LCLMs at 16x compression remain within memory bounds at that context length.

  2. RAG pipeline integration requires tuning. Teams with existing RAG pipelines will need to validate compression behavior against their retrieval quality metrics before deploying at scale.

  3. Reasoning trace compression is unsolved. For agents running long reasoning chains, context growth from the trace is a separate problem from document retrieval. Goldblum acknowledged the gap directly: the naive approach of periodic trace compression might work but has not been tested.

The models are available at huggingface.co/latent-context and the code at github.com/LeonLixyz/LCLM.

“The biggest things our architectures do is give your model access to much larger contexts, but they also unlock multiscale approaches where your model can skim vast amounts of text or code super fast and then only zooms in and fully reads a small portion of the most useful text,” Goldblum said.

Enterprise AI agents keep creating data silos. Microsoft’s Build answer is Microsoft IQ and Rayfin.

Every new AI agent your team deploys starts from scratch: no memory of how the business works, where data lives, or what rules apply. And as agentic coding tools spin up applications faster than anyone can govern them, each one risks becoming another silo outside your data layer entirely. Microsoft is addressing both problems directly at Build 2026.

According to VentureBeat’s VB Pulse’s Q1 2026 RAG Infrastructure Market Tracker, hybrid retrieval intent among 100-plus employee organizations tripled from 10.3% in January to 33.3% in March, a signal that enterprises have moved past expanding RAG coverage and are now focused on the architecture underneath it. Shared business context is the part retrieval does not solve.

On the context side, Microsoft is expanding Fabric IQ, its existing business data context layer, into a broader unified system called Microsoft IQ, adding three additional context sources covering how the organization works, what it knows and real-time global signals from the web, so any agent can tap all four as a single foundation. On the application side, Rayfin, a new open-source SDK and CLI, deploys agent-built applications directly to Fabric as a governed production backend, routing application data into the same platform rather than spinning up new silos.

Amir Netz, CTO of Microsoft Fabric, reached for a film analogy to explain where the data platform fits. The green screen of cascading code in “The Matrix” wasn’t atmosphere, it was the layer that built the world Agent Smith operated in.

“Our job in the world of data is creating reality for agents based on data,” Netz told VentureBeat.

Microsoft IQ unifies four context sources into a single agent foundation

Microsoft IQ brings together four context sources that until now existed separately, designed so a developer can connect a new agent to all four in a single integration step.

Work IQ. Captures how the organization operates day to day, drawing on email, documents, meetings and schedules to give agents an understanding of people, teams and workflows.

Foundry IQ. Manages institutional knowledge, curating and indexing knowledge bases so agents understand what it means to work within the organization, what rules apply and what procedures to follow.

Fabric IQ. Models the live operational state of the business through data, defining entities, relationships and business rules grounded in real-time signals from Fabric Real-Time Intelligence. Ontologies, the layer that captures that operational context, are expected to reach GA in the coming months.

Web IQ. Adds real-time global context from the web, giving agents a current picture of the world outside the organization alongside its internal data.

“The agents are going to become highly informed virtual employees,” Netz said. “That’s where the world is heading.”

Rayfin routes agent-built applications into the same data foundation

Building shared context solves one half of the problem. The other is what happens when agents start generating applications. Every new app needs a backend, and without a governed deployment path each one creates a new data silo outside the context layer entirely.

Rayfin provides an enterprise-grade back end and deploys agent-built applications directly to Fabric, so application data lands in Microsoft OneLake by default and feeds back into the Microsoft IQ context layer rather than accumulating outside it.

Microsoft positions Rayfin against Supabase and Neon, the Postgres-compatible backends that agentic coding tools default to. The differentiator is governance: Rayfin routes the entire application fleet through Fabric’s unified data and compliance layer rather than creating isolated silos.

Netz described the relationship as bidirectional. The agent building a Rayfin application draws from the organization’s ontology. The data that application generates then enriches that ontology for the next agent.

Every major data platform is chasing the same answer, but execution is unproven

Microsoft is not the only platform building a shared context layer for agents. Snowflake announced its own context capabilities this week with semantic capabilities. Pinecone has its Nexus platform that expands the vector database to become a knowledge engine and Redis has developed its Iris context and memory platform.

Microsoft’s approach further reinforces the trend that RAG and model availability aren’t the issue anymore.

“Fabric IQ and Rayfin are important because the enterprise AI challenge is no longer just about the model availability,” Robert Kramer, managing partner at KramerERP told VentureBeat. “The real question is whether Microsoft simplifies execution and strengthens trust or adds another layer to an already complex environment.”