Saturday, May 30, 2026
Stop prompting and start architecting for agentic failure.
May 30 · 3 videos
Three experts said it today.
None of them coordinated.
That's the signal.
Nick Nisi deleted 9,447 lines of agent skills.
Accuracy jumped from 77% to 97%.
Zed trained Zeta2 using 100,000 production examples.
Philipp Schmid says engineers must stop being traffic controllers.
The era of agentic harnesses has begun.
“I'm the bottleneck: I haven't written a line of code myself in probably eight months.”
How I deleted 95% of my agent skills and got better results
Nick Nisi · AI Engineer · 17 min
Watch on YouTube →Nick Nisi explains how WorkOS uses a TypeScript-based harness called Case to enforce reliability. He argues that rigid architectural gates are superior to long-winded prompts.
- Deleting 95% of documentation-based skills improved accuracy from 77% to 97%.
- The Case harness uses a state machine to manage implementer, verifier, and reviewer agents.
- Agents are prevented from faking test results by requiring SHA-256 hashes of actual outputs.
- Providing 10,000 lines of documentation bloated eval times from 6 minutes to 68 minutes.
- Success is measured by delta scores and pass rates rather than subjective prompt quality.
- Treat every agent failure as a bug in the harness rather than a failure of the model.
How We Built Zeta2: Training an Edit Prediction Model in Production
Ben Kunkle · AI Engineer · 10 min
Watch on YouTube →Ben Kunkle describes the production pipeline for Zeta2, a specialized model for real-time code edits. The focus is on high-efficiency distillation and data validation.
- Zeta2 was trained on 100,000 high-quality examples processed through a teacher model.
- A repair step corrects common failure modes like ignoring boundaries or undoing user keystrokes.
- The settlement heuristic measures predictions against code after a 10-second user pause.
- Training focuses on the Goldilocks zone where the student model almost gets the answer right.
- Zed uses 50 parallel predictions per example to identify the most valuable training data.
- Production traffic sampling at 15% is used to validate model performance over offline evals.
Why (Senior) Engineers Struggle to Build AI Agents
Philipp Schmid · AI Engineer · 10 min
Watch on YouTube →Philipp Schmid discusses the mental shift required for senior engineers to build effective AI agents. He advocates for moving from deterministic control to outcome verification.
- Engineers must shift from being traffic controllers to dispatchers who define goals.
- Text and context should be treated as the primary state instead of rigid data structures.
- The build to delete mindset assumes better models will eventually replace custom logic.
- Errors should be treated as inputs to preserve progress in tasks lasting up to 15 minutes.
- Traditional unit tests are being replaced by evaluations to measure reliability over time.
- Agents fail when they only see function schemas without the implicit context humans possess.
References
PeopleNick Nisi · Ben Kunkle · Philipp Schmid · Ryan Leuppolo
ToolsCase Harness · Zeta2 · WorkOS · Zed · Google DeepMind