From NVIDIA to Porsche to Snowden: The Race to Fuel the Future With Better Access to the Past

In many fields, big improvements are coming from a new strategy: closing the gap between what systems already know and what information they can use to make decisions. Using the past to fuel acceleration into the future.


I’m going to start blogging again, because I have things to say and I want to connect with more people, including through my two pieces of software (links above). Communication in public creates addressable surface area and opportunities. Cheers, Marshall


Remember the stories about stock trading companies racing to build offices close to regional Internet connection points so their real-time trading could beat competitors by seconds?

That’s what comes to my mind when I see some new developments where organizations are working on closing the gap between what they already know and the act of making decisions.

New developments over the weekend and late last week, in AI and electric vehicles for example, point to a key architectural insight: investing in the power of informational transparency. The network that can perceive, recall, and comprehend its own state – and transmit that perception to where decisions are made – can outperform the network with superior raw capacity.

This is something I’ve long fantasized about: drawing from the past to help build the future. A bunch of new examples of this came out just this weekend and late last week!

Five fresh examples

[This post was researched and written using my network intelligence system Hawkeye and analytical system WUWT.]

NVIDIA’s 4x speed-up

For example, NVIDIA’s new Dynamo framework achieved a 4x (!) reduction in time-to-first-token by bringing context to their requests for AI inference: specifically, exposing session-level context that inference servers previously couldn’t see. It cites Stripe, Ramp, and Spotify as example customers whose AI agents it’s serving better as a result.

[Click here to open a “three-hop simplified explanation” of what NVIDIA is doing.]

Hop 1

Airport Boarding Priority Lanes

Airlines sort passengers into boarding groups based on status, seat class, and needs — first-class passengers board before economy, and passengers needing extra time get flagged early. The gate agent uses this structured information to sequence and allocate resources efficiently, rather than processing everyone identically in arrival order.

Hop 2

HTTP Headers Signal Server Behavior

When your browser requests a webpage, it attaches metadata headers — like cache preferences, accepted content types, and authorization tokens — that instruct servers how to handle the request before any real processing begins. Servers and intermediary caches read these structured signals to decide whether to serve a cached response, route to a specialist service, or prioritize the request, without the client needing to renegotiate from scratch each time.

Hop 3

Agents Hint Inference Infrastructure Directly

Agent harnesses — the orchestration wrappers that manage AI task execution — can now attach structured metadata to outgoing requests, telling the inference router things like ‘this session is mid-reasoning chain, don’t reset context’ or ‘this request is blocking ten downstream tasks, prioritize it.’ The cache manager also reads these hints to decide whether to reuse a prior computation or bypass the cache entirely, allowing the infrastructure to optimize routing and resource allocation based on real workload semantics rather than treating every request as identical.


Pre-fill as a service

Separately but similarly, Moonshot (Kimi) AI and Tsinghua University researchers have proposed a system it calls “pre-fill as a service” – cross-data center Key Value caches that use new scheduling strategies and model-level architectural shifts to deliver context over commodity ethernet for LLMs to use in inference.

[Click here to get a “three hop simplified explanation” of this proposal.

Hop 1

Compression Makes Streaming Possible

When you stream a video, the raw uncompressed footage would require hundreds of Mbps, but smart compression algorithms identify redundant visual information and strip it away, reducing the stream to a manageable 5-10 Mbps without losing perceptible quality. The key insight is that not every frame needs every pixel recalculated — patterns and context carry forward, so you only transmit what changed.

Hop 2

AI Memory Has a Bandwidth Problem

Large language models maintain a ‘KV Cache’ — a running memory of every previous word’s computed context that must be transferred continuously during generation, similar to how a video stream must carry all active scene data. Full ‘attention’ mechanisms require every token to reference every other token, creating memory transfer rates so large (tens of gigabytes per second) that they can only practically live within a single high-speed server cluster, not across distant locations connected by ordinary network cables.

Hop 3

Hybrid Attention Enables Distributed AI

Hybrid attention models solve this bottleneck by mixing expensive full-attention layers (which capture rich long-range relationships) with linear-complexity layers (which summarize context efficiently, like compression codecs), shrinking the KV Cache bandwidth requirement from ~60 Gbps down to ~3 Gbps. Standard commodity Ethernet infrastructure comfortably handles 3 Gbps, meaning AI inference can now be split across geographically separate datacenters — unlocking distributed, cost-effective deployment that was physically impossible when every layer demanded full uncompressed memory transfer.


Digital twins in data center governance

Similarly, data center infrastructure management software firm Sunbird says its second-generation software offers structural solutions to key problems cited in research from the Uptime Institute: data degradation and governance issues that make it hard for data centers to use what they know from the past to manage present and future operations. Sunbird uses automated governance to block any operational change before it’s recorded to a Digital Twin. In other words, you can be confident that your data about changes in the past is accurate because changes weren’t allowed if they weren’t recorded.

[Get a 3 hop explanation of the problem Sunbird is working to solve]

Hop 1

A Chain’s Weakest Link

A chain can only bear as much weight as its weakest link—adding more strong links doesn’t compensate for one bad one. In any system, the component with the lowest capacity or quality determines the ceiling for the entire system’s performance, regardless of how robust everything else is.

Hop 2

Digital Twins Need Accurate Inputs

A digital twin is a live virtual replica of a physical system—like a data center—that simulates real-world behavior to support decisions. Its usefulness depends entirely on the accuracy of the sensor readings, asset records, and operational data feeding it; a twin built on incomplete or incorrect data produces unreliable simulations, no matter how much raw data volume it ingests.

Hop 3

DCIM’s Binding Constraint: Data Quality

John O’Brien of Uptime Institute argues that Data Center Infrastructure Management (DCIM) platforms functioning as digital twins hit their ceiling not because they lack enough data, but because the data they have is inaccurate, inconsistent, or incomplete. Following the weakest-link logic from Hop 1 and the accuracy-dependency from Hop 2, O’Brien identifies data quality—not data volume—as the true binding constraint that limits how effectively DCIM can model, predict, and optimize a physical data center.


Cars in the Know

Likewise in EVs, Swedish state-owned power company Vattenfall’s new collaboration with Volkswagen includes 200 bidirectional chargers that aims to turn parked vehicles into income-earning, grid-aware agents by giving them price signals and consumption data from the network that they weren’t incorporating before.

Similar principles can be seen in this weekend’s write-up of Formula E auto racing, where Porsche is leading thanks to software that uses data to predict energy needs across remaining laps and competitor behavior to manage its energy storage and use!

The idea: Close the gap between what the system knows internally and what it can surface to its own decision-making apparatus.

These are a lot of engineering stories, but any network can benefit from these principles. Here’s a strategic framework for thinking about it.

A Framework for Thinking About This

Transformation leader Dave Snowden offers a taxonomy of organizational knowledge states around this in a post this weekend – systems have relationships with their knowledge that are either intact (knowledge flows to where it is needed), entrained (accessible but requiring active effort to surface), or severed (the mechanism for moving knowledge has been destroyed). There’s no such thing as a greenfield of opportunity to change, he says, the past is always present – it’s just a question of how you relate to it.

He shares a beautiful story offered by one of the members of his own network of a M?ori ritual of decision making where elders sit in front of decision makers; the past is in front of you, because you can see it, the future behind you, where you cannot see it. I’m tempted to call that an intact system, but that would be a politically naive thing to say.

Or in tech geek speak, as they say at NVIDIA about how they think about closing the gap: “The biggest optimization surface in agentic inference is the gap between what the harness knows and what the infrastructure can see.”

It’s not just about transparent vs non-transparent systems though. In Snowden’s terms, structural and cultural challenges can make this transmission of knowledge less intact (where knowledge flows to where it is needed), and more entrained – accessible but requiring active effort to surface.

The competitive landscape may increasingly sort on this issue. The organizations that build effective transmission capacity – that make their networks genuinely and effectively self-aware rather than merely self-monitoring – will compound advantages that architecturally-blind competitors cannot match through raw capacity alone. Self-knowledge, it turns out, is not a luxury feature for mature systems. It may be a prerequisite for the next generation of intelligent operation at any scale.

Personal notes:

New podcast interview: Cognitive levers, combinatorial possibilities, symphonic thinking, and compound learning were some of the topics I discussed with long-time futurist leader Ross Dawson on the latest episode of Humans+AI.

A great network: I just renewed my annual subscription to the Exponential View newsletter and network. Why? Because it’s a very connected group of global innovators with deep experience in designing and leading exploratory transformations. They take social impact seriously, too. It’s a great place to learn about the cutting edge and future of AI. I love that the team there makes extensive use of both pen (even quill) and paper AND OpenClaw! Speaking of building on the past, I often put them in context by saying that founder Azeem Azhar was one of the very first users of LinkedIn and now briefs world leaders on the implications of exponential growth curves in technology.

Media about building on the past: Today’s discussion brings to mind this groovy community-made video explaining Creative Commons from 2003!


Discover more from Marshall Kirkpatrick's thoughts

Subscribe to get the latest posts sent to your email.