AI Engineer YouTube · June 6, 2026

Evals Are Broken, Use Them Anyway — Ara Khan, Cline

Evals Are Broken, Use Them Anyway — Ara Khan, Cline video thumbnail
Why it matters

Cline started at 43% on Terminal Bench. The improvements came from container CPU and memory settings, raised timeouts, and prompt engineering techniques specific to Anthropic model families that do not transfer to Codex or Gemini. Not from switching to a better model. Ara Khan's argument is that benchmark numbers are n

My takeaway: Cline started at 43% on Terminal Bench. The improvements came from container CPU and memory settings, raised timeouts, and prompt engineering techniques specific to Anthropic model families that do not transfer to Codex or Gemini. Not from switching to a better model. Ara Khan's argument is that benchmark numbers are n