Saturday, May 30, 2026

Stop prompting and start architecting for agentic failure.

AI AgentsAgentic HarnessModel DistillationZeta2Code Edit PredictionSoftware EngineeringReliabilityEvals

May 30 · 3 videos

Three experts said it today.

None of them coordinated.

That's the signal.

Nick Nisi deleted 9,447 lines of agent skills.

Accuracy jumped from 77% to 97%.

Zed trained Zeta2 using 100,000 production examples.

Philipp Schmid says engineers must stop being traffic controllers.

The era of agentic harnesses has begun.

“I'm the bottleneck: I haven't written a line of code myself in probably eight months.”

How I deleted 95% of my agent skills and got better results

Nick Nisi · AI Engineer · 17 min

Watch on YouTube →

Nick Nisi explains how WorkOS uses a TypeScript-based harness called Case to enforce reliability. He argues that rigid architectural gates are superior to long-winded prompts.

Deleting 95% of documentation-based skills improved accuracy from 77% to 97%.
The Case harness uses a state machine to manage implementer, verifier, and reviewer agents.
Agents are prevented from faking test results by requiring SHA-256 hashes of actual outputs.
Providing 10,000 lines of documentation bloated eval times from 6 minutes to 68 minutes.
Success is measured by delta scores and pass rates rather than subjective prompt quality.
Treat every agent failure as a bug in the harness rather than a failure of the model.

How We Built Zeta2: Training an Edit Prediction Model in Production

Ben Kunkle · AI Engineer · 10 min

Watch on YouTube →

Ben Kunkle describes the production pipeline for Zeta2, a specialized model for real-time code edits. The focus is on high-efficiency distillation and data validation.

Zeta2 was trained on 100,000 high-quality examples processed through a teacher model.
A repair step corrects common failure modes like ignoring boundaries or undoing user keystrokes.
The settlement heuristic measures predictions against code after a 10-second user pause.
Training focuses on the Goldilocks zone where the student model almost gets the answer right.
Zed uses 50 parallel predictions per example to identify the most valuable training data.
Production traffic sampling at 15% is used to validate model performance over offline evals.

Why (Senior) Engineers Struggle to Build AI Agents

Philipp Schmid · AI Engineer · 10 min

Watch on YouTube →

Philipp Schmid discusses the mental shift required for senior engineers to build effective AI agents. He advocates for moving from deterministic control to outcome verification.

Engineers must shift from being traffic controllers to dispatchers who define goals.
Text and context should be treated as the primary state instead of rigid data structures.
The build to delete mindset assumes better models will eventually replace custom logic.
Errors should be treated as inputs to preserve progress in tasks lasting up to 15 minutes.
Traditional unit tests are being replaced by evaluations to measure reliability over time.
Agents fail when they only see function schemas without the implicit context humans possess.

References

PeopleNick Nisi · Ben Kunkle · Philipp Schmid · Ryan Leuppolo

ToolsCase Harness · Zeta2 · WorkOS · Zed · Google DeepMind