Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
The paper correctly prioritizes out-of-distribution rank transfer, but its proposed score conflates predictive validity with risk-adjusted utility.
The paper correctly prioritizes out-of-distribution rank transfer, but its proposed score conflates predictive validity with risk-adjusted utility.
The paper offers a useful template/realized-graph/trace distinction and reporting protocol, but lacks a reproducible survey methodology.
DataComp-LM establishes a controlled benchmark for dataset research and finds that aggressive model-based quality filtering is more effective than conventional source mixing.
The Chinchilla paper shows that model parameters and training tokens should scale in approximately equal proportions, enabling smaller, better-trained models.
As a Senior IC, I carry this massive expectation to be the guardian of the code. But I’ve almost entirely stopped looking at or reviewing code. My workflow has completely flipped — from writing and reviewing syntax to specs, agent execution, and validation. This is my professional coming out moment.
The weighted sum approach to hybrid search is fragile and breaks in production. This article introduces the Nomination → Union → Selection architecture: a deterministic pipeline that combines dense and sparse search without magic weights, then uses a Cross-Encoder to rerank results with surgical precision. Learn how to build RAG systems that scale.
Dense vectors are magical at capturing semantics, but they fail when you need exact matches. This article unpacks the Vocabulary Mismatch Problem and introduces SPLADE—a neural approach that combines the precision of keyword search with the intelligence of transformers. Learn why sparse embeddings matter and how to architect hybrid search for production.
Building production RAG systems is fundamentally an ETL (Extract, Transform, Load) challenge. We explore why documents must be treated as hierarchical data structures, not string soup. Discover structure-aware splitting, metadata injection, and multi-resolution indexing strategies that transform data quality and eliminate hallucinations.
We are watching the AI industry commit the original sin of the web all over again. For the last two years, we’ve obsessed over Context Engineering, treating Agents like static, PHP-era websites. When a user asks a question, the system performs a “database fetch” on demand, pulling context just in time to generate an answer. We haven’t reinvented software; we’ve just replaced the mouse click with a prompt, keeping the same brittle, pull-based architecture underneath....
Everyone talks about the Neural Network, but the Tokenizer is the unsung hero of LLMs. This post explains what a Tokenizer actually does, why we use Byte Pair Encoding (BPE), and how these tokens bridge the gap between rigid integers and meaningful vector embeddings in models like GPT-4.