Sunday, May 17, 2026

Stop prompting harder and start engineering the harness.

Agent HarnessPhysical AIDurable SessionsPrompt EngineeringRoboticsHardware Supply ChainToken OptimizationDeterministic AI

May 17 · 5 videos

The model is no longer the bottleneck.

Reliability comes from the harness, not the prompt.

Tejas Kumar says 2026 is the year of the harness.

Caitlin Kalinowski warns of a memory price meteor.

Cloudflare cut token usage by 99.9 percent.

Physical AI is the next frontier.

“The agent harness is everything around the model that gives it grounding in reality. It's literally the thing that ties it to a stable environment.”

Harnesses in AI: A Deep Dive: Tejas Kumar, IBM

Tejas Kumar · AI Engineer · 20 min

Watch on YouTube →

Tejas Kumar argues that agent reliability depends on deterministic code surrounding the model rather than better prompting. This harness allows legacy models like GPT-3.5 to outperform frontier models at lower costs.

A harness provides the deterministic guardrails needed to ground stochastic models in reality.
Focusing on engineering the environment allows GPT-3.5 Turbo to handle complex browser tasks reliably.
The framework includes tool registries, context compaction, and hard iteration caps.
Verification steps should rely on tool history instead of model self-reporting to prevent hallucinations.
Harnesses enable enterprise-grade security for sensitive RAG applications.
Reliability is the 'name of the game': businesses need agents that perform consistently regardless of the model.

Fighting AI with AI: Lawrence Jones, Incident

Lawrence Jones · AI Engineer · 17 min

Watch on YouTube →

Lawrence Jones explains why human debugging is no longer possible for complex AI systems with thousands of prompts. He advocates for fighting AI with AI using specialized internal tools for agents.

Complex AI systems have hit a tractability wall where humans cannot manually trace reasoning errors.
Incident.io uses a Red-Green TDD cycle for prompt engineering to ensure reliability.
Internal tools serialize complex UI traces into filesystems that agents like Claude Code can navigate.
Parallelized analysis pipelines cluster failure patterns across thousands of customer investigations.
Evals are stored in YAML files alongside code to treat them as standard unit tests.
A single AI SRE investigation can involve hundreds or thousands of individual prompts.

Why Your AI UX Is Broken (and It's Not the Model's Fault): Mike Christensen, Ably

Mike Christensen · AI Engineer · 18 min

Watch on YouTube →

Mike Christensen critiques the industry standard of using Server-Sent Events for AI UX. He proposes Durable Sessions to solve data loss during network handovers.

Direct HTTP streaming via SSE is fragile and leads to data loss during Wi-Fi to 5G transitions.
Durable Sessions decouple the agent layer from the client using stateful pub/sub channels.
Bidirectional transport allows clients to signal cancellations, preventing wasted token spend.
Ably handles 2 trillion operations at scale to support resilient multi-surface experiences.
Human support agents can join mid-session with full history to improve resolution times.
Reliability across surfaces is the primary differentiator between a demo and a professional product.

Why we’re at the beginning of the AI hardware boom: Caitlin Kalinowski (ex-OpenAI, Meta, Apple)

Caitlin Kalinowski · Lenny's Podcast · 99 min

Watch on YouTube →

Caitlin Kalinowski discusses the shift from digital AI to Physical AI and robotics. She warns hardware startups about upcoming supply chain shocks in the memory market.

Digital task acceleration is saturating, making the physical world the next frontier for AI.
A looming memory price meteor could increase DRAM costs by 2x to 6x due to data center demand.
Hardware development follows a 4-5 compile model where major builds are strictly limited.
Vertical integration is the primary defense against supply chain instability and vendor failure.
Specialized non-humanoid robots for manufacturing are the immediate future of industrialization.
Set strict KPIs like cost and weight early; changing targets mid-cycle is effectively a total restart.

AIE Singapore Day 2 ft. Google DeepMind, OpenClaw, Adaption, Arize, Cloudflare, Robot Company & more

Sarah Hooker · AI Engineer · 568 min

Watch on YouTube →

This session provides a blueprint for moving from AI vibes to production-grade agentic infrastructure. Experts emphasize craft and deterministic safety over brute-force scaling.

AI is a musical instrument, not a calculator; it requires deliberate, intentional practice to master.
Design is the ultimate competitive edge in an age where AI produces 'the weighted average' of the internet.
Cloudflare Code Mode achieved a 99.9 percent reduction in context window usage for API calls.
Tusk's Fence provides OS-level deterministic safety for agentic operations.
The Molmo-Act-2 dataset includes 700 hours of bimanual teleoperation data for robotics.
Companies that ignore AI face a J-Curve of disruption from lean, AI-native competitors.

References

PeopleTejas Kumar (https://x.com/TejasKumar_) · Lawrence Jones (https://x.com/lawrjones) · Mike Christensen · Caitlin Kalinowski · Steve Jobs · Mark Zuckerberg · Sam Altman · Palmer Luckey · Shelly Goldberg · SallyAnn DeLucia · Geoff Huntley · Sarah Hooker · JJ Geewax

ToolsAgent Harness Framework · Verify-Before-Success · GPT-3.5 Turbo · Claude Code · Red-Green Prompt Eval Cycle · UI-to-Filesystem Serialization · Two-Stage Fleet Analysis · Ably Channels · Durable Sessions · Orion AR glasses · Cloudflare Code Mode · Tusk Fence · Molmo-Act-2