Why it matters
Anthropic evaluates Mythos Preview against ExploitBench, ExploitGym, and an updated smart-contract exploitation benchmark, showing a step change in models that can turn vulnerabilities into working exploit chains.
My takeaway: Important for red-team planning because the benchmark focus moves beyond finding or reproducing bugs toward exploit primitives, sandbox escapes, and end-to-end unauthorized code execution. It raises the bar for capability evaluation and defensive release gating.