Agents

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

The paper correctly prioritizes out-of-distribution rank transfer, but its proposed score conflates predictive validity with risk-adjusted utility.

From Static Templates to Dynamic Runtime Graphs

The paper offers a useful template/realized-graph/trace distinction and reporting protocol, but lacks a reproducible survey methodology.

Behavior as the Credential: Why Static Auth Fails AI Agents

Traditional authorization hands a token to an AI agent and hopes for the best. But when an agent is hijacked via prompt injection, the static token offers zero defense. This post argues that behavior must become the credential—a real-time enforcement mechanism that treats observability as authorization, catching semantic anomalies that RBAC and ABAC simply cannot detect.

I Shipped Code I Don't Understand — My Professional Coming Out Moment

As a Senior IC, I carry this massive expectation to be the guardian of the code. But I’ve almost entirely stopped looking at or reviewing code. My workflow has completely flipped — from writing and reviewing syntax to specs, agent execution, and validation. This is my professional coming out moment.

Context Plumbing: From Request-Response to Event Sourcing for Agents

We are watching the AI industry commit the original sin of the web all over again. For the last two years, we’ve obsessed over Context Engineering, treating Agents like static, PHP-era websites. When a user asks a question, the system performs a “database fetch” on demand, pulling context just in time to generate an answer. We haven’t reinvented software; we’ve just replaced the mouse click with a prompt, keeping the same brittle, pull-based architecture underneath....