Why it matters
Static benchmarks made sense for static software. Agents that adapt to users, rewrite their own harnesses, and shift behavior over time break that assumption. This talk is about what evaluation looks like when the system you're measuring keeps changing underneath you. Vincent Koc traces the arc from prompt engineering
My takeaway: Static benchmarks made sense for static software. Agents that adapt to users, rewrite their own harnesses, and shift behavior over time break that assumption. This talk is about what evaluation looks like when the system you're measuring keeps changing underneath you. Vincent Koc traces the arc from prompt engineering