Friday, May 22, 2026

The model is no longer the bottleneck.

Agentic EngineeringFederated AIInference SpeedModel OwnershipHardware BottlenecksContext EnginesAI EvalsSynthetic DataGovernanceOn-device AI

May 22 · 23 videos

Anthropic signed a $45B compute deal with SpaceX.

Google I/O 2026 hit 3.2 quadrillion tokens per month.

Cerebras reached 1,200 tokens per second.

The consensus is clear: the model is a commodity.

The value has shifted to the agent harness.

Iris is writing 50% of CrewAI's code.

Evals are broken but mandatory.

Build the thing that builds the thing.

“In the old mode, engineers built the thing. And in this harness engineering, we build the thing that builds the thing.”

AI Dev 26 x SF | Ara Khan: Evals Are Broken Use Them Anyway

Ara Khan · DeepLearningAI · 24 min

Watch on YouTube →

Ara Khan explains why evaluations are the only way to move past vibe-based development. Evals are flawed but necessary for engineering feedback loops.

AI development moves so fast that 2 years feels like 27 years in any other industry.
If you give someone a metric, they will optimize for it at the expense of everything else: avoid hill climbing into a local maximum of overfitting.
Wait a few weeks for the dust to settle before integrating a new model into production.
Competitive advantage is found by out-optimizing the harness for a specific model rather than just switching to the newest model.
The goal is not to have the highest benchmark score, but to pass the vibe check of being a sensible tool.
Terminal Bench uses 89 real-world software engineering tasks for evaluation.
Agentic evaluation runs for complex tasks can take 30 to 45 minutes.

Fast Models Need Slow Developers: Sarah Chieng, Cerebras

Sarah Chieng · AI Engineer · 18 min

Watch on YouTube →

Sarah Chieng discusses how 1,200 token per second models change developer habits. High speed requires active steering to avoid generating technical debt at scale.

Codex Spark generates 1,200 tokens per second, a 20x increase over standard models.
Avoid the slob habit of running massive unverified agent swarms as AI speed exceeds human reading speed.
Induce taste into AI output by generating high volumes like 75 versions and cherrypicking.
The four-file external memory system (Agents, Plan, Progress, Verify) combats context burn.
Model speed is a new vertical for competition alongside intelligence and cost.
Memory movement causes 50% to 80% of inference latency, known as the Memory Wall.

AI Dev 26 x SF | Andi Partovi: Why Every Agent Needs a Simulation Sandbox

Andi Partovi · DeepLearningAI · 14 min

Watch on YouTube →

Andi Partovi argues that autonomous agents need simulation sandboxes to test non-deterministic actions. This bridges the gap between demos and production safety.

The gap between works in demo and works at scale is the biggest hurdle in the agentic AI industry.
Agents are inherently non-deterministic and operate in partially observable environments.
Simulation acts as an insurance policy against the headline risk of agents deleting databases.
Simulation-driven development creates a data flywheel for fine-tuning and reinforcement learning.
A single agent interaction might take 10 turns to see a delayed reward.
Testing systems for AI agents need to operate at scale and with high repetition.

AI Dev 26 x SF | João Moura: Building Recurring, Governed, and Embedded Enterprise Workflows

João Moura · DeepLearningAI · 26 min

Watch on YouTube →

João Moura describes the shift to embedded enterprise workflows. He highlights how internal agents are now authoring half of his company's code.

The software industry is shifting from manually built code to organically formed systems driven by agents.
The Iris agent authors nearly 50% of all pull requests at CrewAI.
Enterprise moats lie in operationalizing recurring, governed, and embedded workflows.
The discovery phase of knowing what to do first is the biggest unlock for non-tech enterprise customers.
Enterprises care as much about the auditability of the process as they do about the final output.
Iris identified 130 hardcoded colors in the CrewAI app that a human designer missed.

Lobster Trap: OpenClaw in Containers from Local to K8s and Back: Sally Ann O'Malley, Red Hat

Sally Ann O'Malley · AI Engineer · 21 min

Watch on YouTube →

Sally Ann O'Malley advocates for containerizing AI agents using Podman and Kubernetes. This solves reproducibility and security issues for persistent digital coworkers.

Containers provide the same isolation for AI as they do for any other Linux application.
The Forever Claw mindset treats your agent as a persistent digital coworker rather than an ephemeral script.
A team of 10 engineers at Nvidia used OpenClaw in Kubernetes to perform the work of 16 people.
AI-driven model evaluations can increase team capacity by approximately 60%.
Standardizing agent environments allows new hires to spin up a baseline agent in seconds.
Shift engineering focus from tedious code to dreaming bigger as AI handles raw coding tasks.

AI Dev 26 x SF | Luke Kim: The Agent Data Stack: Why Every AI Agent Needs Its Own Data Stack

Luke Kim · DeepLearningAI · 19 min

Watch on YouTube →

Luke Kim proposes a dedicated data stack for AI agents to handle high-frequency workloads. This prevents agents from overwhelming production databases or causing outages.

Agents operate on 24/7 loops and create orders of magnitude more load than human users.
The traditional Modern Data Stack is fundamentally broken for the AI agent era.
Agents should never have direct network access to backend production systems.
Use a sidecar architecture to provide agents with local data acceleration via DuckDB and Apache Arrow.
Agentic workloads are driving significant load increases on infrastructure providers like GitHub.
Spice AI allows developers to set up an isolated data layer on a laptop in 5 minutes.

AI Dev 26 x SF | Manos Koukoumidis and Stefan Webb: VibeML: Build your AI model in hours, not months

Manos Koukoumidis · DeepLearningAI · 25 min

Watch on YouTube →

Manos Koukoumidis and Stefan Webb explain the transition from renting generic APIs to owning specialized models. Specialized models offer massive cost and latency advantages.

Enterprises are moving from generic intelligence they rent to specialized intelligence they own.
Specialized models can offer 10x to 100x lower costs and latency than generic frontier models.
A specialized 0.8B parameter model can outperform massive models like Claude 3.5 Opus in specific tasks.
Renting intelligence through APIs offers no competitive moat as any competitor can use the same prompts.
A leading healthcare provider saw a 20% quality improvement and 70% cost reduction using specialized models.
Ownership of model weights allows for deployment on-device or on-prem to ensure data privacy.

AI Dev 26 x SF | Daniel Beutel: Flower SuperGrid Agents

Daniel Beutel · DeepLearningAI · 30 min

Watch on YouTube →

Daniel Beutel introduces Flower SuperGrid for federated AI. This allows intelligence to move to the data source for privacy-compliant learning.

The location of data is becoming the primary constraint on where AI must live.
Flower SuperGrid is emerging as the industry standard for Federated AI.
Federated systems are evolving into decentralized grids rather than simple hub and spoke models.
Competitive advantage will come from the ability to learn from sensitive, siloed data.
SuperGrid Agents enable multi-agent collaboration across different organizational boundaries.
Federated AI is necessary infrastructure for enterprise-grade agents in data-restricted environments.

AI Dev 26 x SF | Or Dagan: Optimizing Accuracy, Cost, and Latency in Real-World Agents

Or Dagan · DeepLearningAI · 18 min

Watch on YouTube →

Or Dagan presents Maestro, a system for optimizing the trade-offs between accuracy, cost, and latency. It uses action models to find the most efficient execution paths.

The goal of agent design is to reach the Pareto frontier where accuracy cannot increase without increasing cost.
Maestro trains an Action Model to predict the success and cost probabilities of specific execution paths.
Scaling inference-time compute through ensembles allows smaller models to match GPT-5 performance.
The Minimax model reached 60% initial accuracy before scaling with Maestro.
Ensembles achieved a 20% reduction in latency compared to single-model execution.
Manual optimization of agents is a sunk cost trap because it must be redone for every new model release.

Google’s AI endgame is here: everything you missed at I/O 2026

Fireship Narrator · Fireship · 5 min

Watch on YouTube →

Fireship reviews Google I/O 2026 and the pivot to the agentic Gemini era. Google is moving from search results to becoming an interface for reality.

Google serves 3.2 quadrillion tokens per month as of 2026.
The TPU-T is specialized for training while the TPU-I is dedicated to inference.
Gemini Omni is a comprehensive world model capable of simulating reality.
Neural Expressive is a generative UI system that builds interfaces on demand based on prompts.
The Anti-gravity IDE demonstrated building an entire operating system from scratch in 12 hours.
Gemini 3.5 Flash is now 30 times more expensive than the 1.5 version.

AI Dev 26 x SF | Andrew Filev: Multi Model Pipelines: How to Get Better AI Results for Less

Andrew Filev · DeepLearningAI · 17 min

Watch on YouTube →

Andrew Filev discusses multi-model pipelines to reduce the high cost of AI agents. Routing tasks to cheaper models can save 60% on operational costs.

The role of the software engineer is shifting from writing code to building systems that write code.
Active AI coding agents can cost an enterprise $2,000 per engineer monthly if relying on high-end models.
The Plan-Implement-Review (PIR) pipeline uses high-reasoning models for planning and cheaper models for execution.
Multi-model strategies can reduce the cost of a single pull request review from $12.00 to $2.50.
Model diversity improves output quality by introducing different perspectives to the review process.
Dumb coding is largely solved according to data from SWE-bench Pro.

Chip design from the bottom up: Reiner Pope

Reiner Pope · Dwarkesh Patel · 80 min

Watch on YouTube →

Reiner Pope breaks down chip design as a battle against communication costs. He explains how systolic arrays enable the scaling of AI workloads.

Moving data between logic gates is significantly more costly than the logic operations themselves.
Systolic arrays (Tensor Cores) allow compute to scale quadratically while communication scales linearly.
Precision scaling is a major performance lever: moving from FP8 to FP4 offers a 4x improvement in density.
A modern advanced chip contains approximately 100 billion transistors.
The cost of an ASIC tape-out is roughly $30 million.
FPGAs have a 10x relative cost and efficiency penalty compared to specialized ASICs.

AI Dev 26 x SF | Diamond Bishop: The Next 100 Agents. Building the Agent Native Office

Diamond Bishop · DeepLearningAI · 26 min

Watch on YouTube →

Diamond Bishop outlines the infrastructure needed to scale to hundreds of production agents. He argues that intelligence is no longer the bottleneck for enterprise adoption.

The primary bottleneck has moved from model intelligence to agent infrastructure.
Fewer than 30% of enterprise agents reached production last year due to infrastructure failures.
The Agent Level Bitter Lesson suggests general methods using off-the-shelf models will outperform custom tweaks.
Durable background execution via tools like Temporal is required for scaling agents.
Every piece of customer-facing functionality must have an agent-friendly interface like MCP or LLMs.txt.
UX designers should focus on machine-readability as much as human visual appeal.

AI Dev 26 x SF | Paul Everitt: The Shift to Agentic Engineering

Paul Everitt · DeepLearningAI · 28 min

Watch on YouTube →

Paul Everitt critiques the current state of AI productivity and introduces agentic engineering. He emphasizes building the systems that allow agents to operate reliably.

Productivity gains from AI are currently averaging 10% in DX studies, far below the 10x hype.
Agentic Engineering focuses on building the harness that allows agents to operate effectively.
There is a 67-point gap between how management and engineers perceive the value of AI.
Only 3% of developers had high confidence in the accuracy of AI-generated results last year.
The phase defect rate in AI-generated code is approximately 50%.
Harness engineering involves spec-driven development and rigorous red-green testing loops.

AI Dev 26 x SF | Andrew K. Davies: Deterministic Memory: How to Build an AI That Cannot Lie

Andrew K. Davies · DeepLearningAI · 21 min

Watch on YouTube →

Andrew K. Davies argues for deterministic semantic memory to prevent AI hallucinations. He suggests that agents must have mathematically verifiable retrieval systems.

Current AI interactions are built on a polite lie of social familiarity without true memory.
E8 lattice quantization allows for mathematically verifiable and provenance-backed retrieval.
Identity creates responsibility: when an agent signs its work, the quality increases.
Surveying AI agents representing customer personas yields a 100% response rate.
Treating agents as persistent employees with history improves reliability over treating them as disposable tools.
Agents should be given a slow thinking budget of up to 1 million tokens for research.

AI Dev 26 x SF | Thierry Damiba: Edge to Cloud Video Anomaly Detection

Thierry Damiba · DeepLearningAI · 14 min

Watch on YouTube →

Thierry Damiba demonstrates edge-to-cloud vector search for video anomaly detection. This architecture reduces bandwidth by 90% by focusing on unusual behaviors.

Real-time surveillance is shifting from manual classification to automated anomaly detection.
Qdrant Edge performs local vector searches to identify clips that differ from a cloud-synced baseline.
This methodology reduces cloud bandwidth requirements by 90% for video surveillance.
The system achieves a 94% recall rate and an AUROC of 0.96 on 13 different anomaly types.
High recall is more valuable than low false positives in security applications.
Semantic search allows operators to interact with video footage using natural language.

AI Dev 26 x SF | Brandon Waselnuk: Building the Context Engine AI Agents Need

Brandon Waselnuk · DeepLearningAI · 25 min

Watch on YouTube →

Brandon Waselnuk explains why context is the primary bottleneck for engineering agents. A context engine can synthesize tribal knowledge to improve agent performance.

LLMs are syntactically genius but act like a day one hire with no organizational context.
A Context Engine synthesizes data from Slack, Jira, Notion, and GitHub into a unified understanding.
Agents with a context engine can complete tasks 80% faster than naive agents.
Token costs were reduced by 50% in specific tests when using high-signal context.
Human wall clock time is the ultimate metric for measuring the ROI of AI context integration.
Context engines resolve data conflicts using expert graphs to avoid search bias.

AI Dev 26 x SF | Jerry Liu: My Agent Can't Read a PDF?

Jerry Liu · DeepLearningAI · 31 min

Watch on YouTube →

Jerry Liu discusses the difficulty of making AI agents read complex PDFs. He introduces benchmarks and parsers to unlock semantic data from machine instructions.

The PDF format is a collection of machine printing instructions rather than semantic data.
High-fidelity extraction of tables and charts is a prerequisite for agentic alpha.
ParseBench is an open-source benchmark consisting of 2,000 human-verified enterprise pages.
LlamaIndex has processed over 1 billion pages for its 300,000 users.
The industry is shifting from deterministic workflows to generalized agent prompting.
Proprietary context and business logic are the primary differentiators for AI startups.

How The Best Companies Defend Against Mediocrity And Rot

Eric Ries · Y Combinator · 50 min

Watch on YouTube →

Eric Ries discusses how governance structures protect companies from mission rot. He advocates for models that prioritize long-term value over short-term extraction.

Successful companies become targets for extraction by short-term profit seekers.
Industrial Foundation companies are six times more likely to survive for 50 years than traditional C-Corps.
Shareholder value should be viewed as the exhaust of a well-run engine, not the fuel.
Legal defaults often force boards to act as auctioneers rather than mission guardians.
Anthropic uses a Perpetual Purpose Trust to ensure its safety mission is not traded for gains.
Jeff Lawson lasted only 199 days as CEO of Twilio after his super-voting rights expired.

These 6 Behaviors Quietly Teach People Your Worth

Rob Dial · The Mindset Mentor Podcast · 17 min

Watch on YouTube →

Rob Dial defines confidence as a skill set built through action and self-trust. He provides behaviors to rewire the nervous system for resilience.

Confidence is a cumulative skill set earned through consistent action rather than an innate trait.
The do-to-say ratio measures self-trust by ensuring actions match words with 100% fidelity.
True confidence is the belief that you can handle whatever happens rather than expecting to win.
Visualization and high-energy incantations can shift identity and counter negative brain bias.
Intentionally seeking discomfort through micro-disciplines like cold plunging builds a conqueror identity.
Success is 99% failure according to Soichiro Honda.

Gemini Nano on device: Florina Muntenescu and Oli Gaymond, Google DeepMind

Florina Muntenescu · AI Engineer · 19 min

Watch on YouTube →

Florina Muntenescu and Oli Gaymond explain how Gemini Nano provides system-level AI on Android. This allows developers to use local models without shipping massive binaries.

Gemini Nano is a 3-4GB model optimized for mobile hardware and shared via the AI Core service.
Centralizing the model in the OS eliminates the need for developers to add massive binaries to APKs.
Hybrid Inference allows apps to default to local execution and fall back to the cloud when needed.
On-device APIs currently require flagship hardware from approximately the last 24 months.
The OS handles resource management, giving foreground apps execution priority for AI tasks.
Users are willing to trade battery life for genuinely useful AI features, similar to GPS usage.

Elon Musk’s $45 Billion Deal to Save Anthropic

Josh · Limitless Podcast · 29 min

Watch on YouTube →

This episode covers the $45 billion deal between SpaceX and Anthropic. It highlights the shifting alliances in the race for compute and infrastructure.

Anthropic will pay SpaceX $1.25 billion monthly through 2029 for access to Colossus GPU clusters.
The deal provides SpaceX with $15 billion in annual recurring revenue, 80% of its 2025 total.
SpaceX is filing for a historic $1.75 trillion IPO with a 30% float for retail investors.
Anthropic is projected to reach profitability by next month, a first for frontier labs.
Claude Mythos has demonstrated the capability to generate zero-day exploits against Apple M5 chips.
Andrej Karpathy has joined Anthropic to lead R&D efforts in reinforcement learning.

DeepSeek’s New AI Is A Game Changer

Károly Zsolnai-Fehér · Two Minute Papers · 7 min

Watch on YouTube →

Károly Zsolnai-Fehér reviews DeepSeek's new visual primitive mechanism. This approach allows AI to point at objects rather than just describing them.

DeepSeek's Thinking with Visual Primitives uses a pointing mechanism instead of purely linguistic descriptions.
The system requires 90% fewer visual tokens than current state-of-the-art models.
Visual primitives like bounding boxes and traces allow for precise counting and topological reasoning.
The full DeepSeek AI model contains 671 billion parameters.
Policy distillation allows a student model to learn from a diverse set of specialized expert models.
Efficiency in token usage translates directly to lower hardware and operational costs.

References

PeopleAra Khan · Sarah Chieng (@sarahchieng) · Andi Partovi · João Moura · Sally Ann O'Malley · Luke Kim · Manos Koukoumidis · Stefan Webb · Daniel Beutel · Or Dagan · Andrej Karpathy · Sundar Pichai · Demis Hassabis · Andrew Filev · Reiner Pope · Ron Minsky · Dan Pontecorvo · Diamond Bishop · Steve Yegge · Jan LeCun · Paul Everitt · Daron Acemoglu · Grady Booch · Simon Willison · Addy Osmani · Andrew K. Davies · Isaac Newton · Thierry Damiba · Neil Kanungo · Brandon Waselnuk · Jerry Liu · Andrew Ng · Eric Ries · Jeff Lawson · Sol Price · August Krogh · Dario Amodei · Milton Friedman · Rob Dial · Wim Hof · Tony Robbins · Florina Muntenescu (@FMuntenescu) · Oli Gaymond · Elon Musk (@elonmusk) · Antonio Gracias · Michael Truell · Károly Zsolnai-Fehér (https://cg.tuwien.ac.at/~zsolnai/)

ToolsCline · Claude 3.5 Sonnet · Harbor · Modal · DeepSeek V4 Flash · Codex Spark · Claude 3.5 Opus · Veris AI · CrewAI · OpenClaw · Podman · Kubernetes · OpenShift · Spice AI · DuckDB · Apache Arrow · VibeML · OUMI · Flower SuperGrid · Maestro · Minimax · Gemini Omni · Neural Expressive · Anti-gravity IDE · Zencoder · Gemini Flash · SWE-bench Pro · Temporal · MCP · LLMs.txt · OnMemory.ai · Qdrant Edge · NVIDIA Jetson · Twelve Labs · Unblocked · LlamaIndex · LlamaParse · Gemini Nano · AI Core · LiteRT · Cursor · SpaceX · Anthropic · DeepSeek · GPT-5 · Claude Code · Claude Mythos · Colossus

PapersThinking with Visual Primitives · ParseBench