Topic

Model Evaluation

Safety evaluations, system cards, preparedness, and security measurement for frontier models.

system cardevaluationpreparednessbenchmarkfrontier risk
Evergreen Overview

Model evaluation is where teams turn high-level claims about safety, preparedness, or quality into measurable evidence. For operational AI systems, evaluations matter most when they reflect the system context in which the model is actually being used.

What evaluations should cover
  • Capability, misuse, and safety behavior under realistic tasks
  • System cards, preparedness reporting, and evidence for launch decisions
  • Regression testing so known failures do not quietly reappear
Where programs fall short
  • Benchmarks that do not match the deployed workflow
  • Safety claims without repeatable evidence
  • No connection between findings, mitigations, and re-testing
Who this page is for
  • Teams building evaluation pipelines
  • Leaders interpreting evidence for safe deployment
  • Security and policy teams interpreting model documentation
References

Current notes, events, and source material

These items are included because they add useful evidence, framing, implementation detail, or upcoming context for teams working in this area.

How I deleted 95% of my agent skills and got better results — Nick Nisi, WorkOS video thumbnail Play video
AI Engineer YouTube May 30, 2026 video

How I deleted 95% of my agent skills and got better results — Nick Nisi, WorkOS

Claude would fake running tests by touching the expected output file. Nick Ni, DX engineer at WorkOS, fixed it by SHA-256 hashing the actual test output and verifying it cryptographically. His principle: make it easier to do the real work than to lie about it, and enforce that through code and state machines, not promp

How We Built Zeta2: Training an Edit Prediction Model in Production — Ben Kunkle, Zed video thumbnail Play video
AI Engineer YouTube May 30, 2026 video

How We Built Zeta2: Training an Edit Prediction Model in Production — Ben Kunkle, Zed

To validate settled data, Zed ran 10 frontier model predictions per example and measured Levenshtein distance to the final state. For 100,000 training examples that is a million frontier model requests, which is prohibitively expensive. The fix: Zeta 2's student model now approaches teacher quality, so they run it 50 t

Why (Senior) Engineers Struggle to Build AI Agents — Philipp Schmid, Google DeepMind video thumbnail Play video
AI Engineer YouTube May 30, 2026 video

Why (Senior) Engineers Struggle to Build AI Agents — Philipp Schmid, Google DeepMind

A `deleteItem` endpoint is obvious to the developer who built it. An agent only sees the function schema and docstring. Philipp Schmid from Google DeepMind argues this is why senior engineers struggle most: they carry years of implicit context that agents do not, and design tools assuming it. He names four other shifts

Reachy Mini: the $300 open source robot you can actually hack — Andres Marafioti, Hugging Face video thumbnail Play video
AI Engineer YouTube May 29, 2026 video

Reachy Mini: the $300 open source robot you can actually hack — Andres Marafioti, Hugging Face

Qwen3-TTS shipped at 0.8x real time: one second of audio took 1.2 seconds to generate. Andres Marafioti from Hugging Face spent two weeks fixing it. The culprits were no streaming, 500 autoregressive steps per audio packet with a CPU GPU round trip on each, and a dynamic KV cache that blocked compilation. Static KV cac

Why your agents need decision traces, not just documents — Zach Blumenfeld, Neo4j video thumbnail Play video
AI Engineer YouTube May 29, 2026 video

Why your agents need decision traces, not just documents — Zach Blumenfeld, Neo4j

A knowledge base tells a financial analyst agent the risk factors. A context graph tells it whether to reject or accept, because it also carries past decision traces, the reasoning behind them, and how similar cases resolved. Zach from Neo4j walks through how context graphs extend a standard RAG setup with three layers

Reverse engineering a Viking VOIP phone protocol with Claude Code — Boris Starkov, Eleven Labs video thumbnail Play video
AI Engineer YouTube May 29, 2026 video

Reverse engineering a Viking VOIP phone protocol with Claude Code — Boris Starkov, Eleven Labs

A Viking VoIP phone sat in the ElevenLabs San Francisco office for a year. Three senior engineers and ChatGPT could not get it working. Boris from ElevenLabs cracked the undocumented protocol with Claude Code in a couple of days: brute forced all 676 possible two letter command combinations, found 80 valid ones, then s

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust video thumbnail Play video
AI Engineer YouTube May 28, 2026 video

How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust

Traditional observability answers one question: is the system up? Phil Hetzel from Braintrust argues that question is not the right one for agents. An individual agent trace can exceed a gigabyte. A single span can hit 20 megabytes. The data is semistructured, packed with unstructured text, and still arrives in real ti

Most Enterprise Agentic Projects Are Doomed, Here's Why — Jess Grogan-Avignon & Jack Wang, Accenture video thumbnail Play video
AI Engineer YouTube May 28, 2026 video

Most Enterprise Agentic Projects Are Doomed, Here's Why — Jess Grogan-Avignon & Jack Wang, Accenture

Jess Grogan-Avignon and Jack Wang at Accenture built an agentic application in two weeks. Getting it to production took another 12 months. Not because the code was wrong. Because the infrastructure team, the security team, the AI gateway team, the data governance team, and the application team all had to align before a

Context Graphs for Explainable, Decision-Aware AI Agents — Andreas Kollegger & Zaid Zaim, Neo4j video thumbnail Play video
AI Engineer YouTube May 28, 2026 video

Context Graphs for Explainable, Decision-Aware AI Agents — Andreas Kollegger & Zaid Zaim, Neo4j

Prescribing drug X is correct 99% of the time for symptom Y. For the 1% where it is fatal, statistical reasoning does not help you. Andreas Kollegger calls this reference class validation: before the agent acts, it has to know which group it is in. Context graphs give agents the why. Not just knowledge and tools but th

Comprehend First, Code Later: The AI Skill I Rely On Daily — Priscila Andre de Oliveira, Sentry video thumbnail Play video
AI Engineer YouTube May 27, 2026 video

Comprehend First, Code Later: The AI Skill I Rely On Daily — Priscila Andre de Oliveira, Sentry

Priscila Andre de Oliveira analyzed 116 of her own Claude sessions from daily work at Sentry. 67% were comprehension. 2% were code generation. Working in a codebase with 15 years of history, around 100 PRs merged per day, and 100,000 organizations depending on it, the unlock is not generation but understanding. She bui

Why Rust is the Ideal Language for Vibe-Coding — Daniel Szoke, Sentry video thumbnail Play video
AI Engineer YouTube May 27, 2026 video

Why Rust is the Ideal Language for Vibe-Coding — Daniel Szoke, Sentry

TypeScript is easy for models to write because it imposes few constraints. Those same missing constraints let models introduce data races that compile, run, and only fail intermittently. A thread safety bug in Rust does not compile. The compiler names the unsound type, explains why it cannot be sent between threads, an

The maturity phases of running evals — Phil Hetzel, Braintrust video thumbnail Play video
AI Engineer YouTube May 27, 2026 video

The maturity phases of running evals — Phil Hetzel, Braintrust

Most teams approach evals like unit tests and try to cover every possible failure. Phil Hetzel from Braintrust argues that is the wrong frame: enumerate your known failure modes, cover those specifically, and ship. The goal is a flywheel where production traces surface what is going wrong, feed back into offline experi

What the Best Agents Share — Mardu Swanepoel, Flinn AI video thumbnail Play video
AI Engineer YouTube May 26, 2026 video

What the Best Agents Share — Mardu Swanepoel, Flinn AI

Harvey, Cursor, Manus, and Claude operate in completely different domains but share four patterns: focus modes that constrain the action space to improve output quality, transparent execution that surfaces tool calls and reasoning to build user trust, personalization that optimizes for speed to understanding rather tha

Stop babysitting your agents... — Brandon Walsenuk, Unblocked video thumbnail Play video
AI Engineer YouTube May 26, 2026 video

Stop babysitting your agents... — Brandon Walsenuk, Unblocked

Same prompt. Same agent. Same model. Without a context engine: 2.5 hours, 20.9 million tokens, multiple rounds of human correction, and code that compiled but would have broken the entire system if it shipped. With one: 25 minutes, 10.8 million tokens, and a senior engineer who gave one nitpick and approved the merge.

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind video thumbnail Play video
AI Engineer YouTube May 25, 2026 video

Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other. The harness they run inside shifts performance by 22%. A competing lab once took a Kaggle benchmark, reran it with their own compaction settings, and published much better results. Neither number was wrong. Both were useless.

Does GenAI "belong" to data scientists? — Phil Hetzel, Braintrust video thumbnail Play video
AI Engineer YouTube May 25, 2026 video

Does GenAI "belong" to data scientists? — Phil Hetzel, Braintrust

At most traditional enterprises, GenAI got handed to the ML platform team because it had AI in the name. Phil Hetzel from Braintrust argues that was the wrong move, not because data scientists lack value, but because Anthropic and OpenAI already ran the data pipeline. What is left is prompt and context engineering, dis

Bounded Autonomy: Between Free Will and Determinism — Angus J. McLean, Oliver video thumbnail Play video
AI Engineer YouTube May 25, 2026 video

Bounded Autonomy: Between Free Will and Determinism — Angus J. McLean, Oliver

Angus McLean spent time building a complex agent application to generate his CV. Four letters beat it: HTML. He puts the improvement at 100x. The talk is from Oliver's AI Director, where agents generate around 4,000 creative assets a day for 200 plus brands, assets you have probably seen and had no idea were AI. The co

How Google DeepMind Runs Agents at Scale — KP Sawhney & Ian Ballantyne, Google DeepMind video thumbnail Play video
AI Engineer YouTube May 24, 2026 video

How Google DeepMind Runs Agents at Scale — KP Sawhney & Ian Ballantyne, Google DeepMind

Google DeepMind employees have worse token quotas than paying customers. That is not a mistake. KP Sawhney explains: customers get priority, and if an internal team spikes usage on a cluster someone monitoring 24/7 will just call and ask them to stop. This panel covers how DeepMind thinks about agents at scale from the

Scaling the Next Paradigm of Heterogeneous Intelligence — Adrian Bertagnoli, Callosum video thumbnail Play video
AI Engineer YouTube May 24, 2026 video

Scaling the Next Paradigm of Heterogeneous Intelligence — Adrian Bertagnoli, Callosum

A mixture of Qwen 3 VL8B and Kimi K2.5 beat the state of the art on Video Web Arena, outperforming the leading GPT and Gemini models by 18 and 25 percent while costing 3.7 times less and running 3 times faster. The reason it worked is that visual web navigation decomposes into subtasks that do not all need a frontier m

Your Agent Is an Infinite Canvas — RL Nabors, Dressed for Space video thumbnail Play video
AI Engineer YouTube May 23, 2026 video

Your Agent Is an Infinite Canvas — RL Nabors, Dressed for Space

RL Nabors built a comic reader that renders inside Claude. Full panels, navigation, transcript mode, design matched to the original site. No browser tabs. She is reading her own web comic archive entirely through an agent, and it looks like the website. The talk is a case against chat as the permanent UI of agentic sof

The Missing Primitive for Agent Swarms — Lou Bichard, Ona video thumbnail Play video
AI Engineer YouTube May 23, 2026 video

The Missing Primitive for Agent Swarms — Lou Bichard, Ona

Stripe called theirs Minions. RAMP called theirs Inspect. Both are internal infrastructure for running fleets of background agents, and both teams built it from scratch. Lou Bichard's argument is that this shouldn't keep happening. The talk breaks down what agent swarm infrastructure actually needs: a runtime (largely

Prompt to Pipeline: Building with Google's Gen Media Stack — Paige & Guillaume, Google DeepMind video thumbnail Play video
AI Engineer YouTube May 23, 2026 video

Prompt to Pipeline: Building with Google's Gen Media Stack — Paige & Guillaume, Google DeepMind

A public domain book, a notebook, and three gen media models. Guom from Google DeepMind fed Wind in the Willows into Gemini, generated character portraits with Nano Banana, animated chapter scenes with VO, and scored each chapter with LIA, all live in the workshop. The full three hour session covers more ground. Paige

Fast Models Need Slow Developers — Sarah Chieng, Cerebras video thumbnail Play video
AI Engineer YouTube May 22, 2026 video

Fast Models Need Slow Developers — Sarah Chieng, Cerebras

Codex Spark, a model Cerebras built with OpenAI, generates code at 1,200 tokens per second. The Sonnet and Opus families run at 40 to 60. At that 20x difference, a context window that used to take ten minutes to fill now takes 30 seconds, and every habit built around slow generation starts producing technical debt at a

Lobster Trap: OpenClaw in Containers from Local to K8s and Back — Sally Ann O'Malley, Red Hat video thumbnail Play video
AI Engineer YouTube May 22, 2026 video

Lobster Trap: OpenClaw in Containers from Local to K8s and Back — Sally Ann O'Malley, Red Hat

Sharing a good agent setup usually means handing someone a pile of markdown, config files, and YAML and hoping they reproduce what you have. The answer in this demo is a container image: spin up a sub agent in two seconds from a Podman command, flip a flag for Kubernetes, and your personal setup becomes the team baseli

AI on Android: Ask me Anything — Florina Muntenescu & Oli Gaymond, Google DeepMind video thumbnail Play video
AI Engineer YouTube May 22, 2026 video

AI on Android: Ask me Anything — Florina Muntenescu & Oli Gaymond, Google DeepMind

Gemini Nano on device weighs three to four gigabytes. Shipping that per app is not realistic, which is why AI core puts it in the system once and every app shares it. Foreground apps get top priority. Background batch jobs queue and run overnight on charge. The developer never manages any of that. The tradeoff is reach

Cooking with Agents in VS Code — Liam Hampton, Microsoft video thumbnail Play video
AI Engineer YouTube May 21, 2026 video

Cooking with Agents in VS Code — Liam Hampton, Microsoft

One codebase, three problems, three agents running at the same time. Liam Hampton from Microsoft demos the full loop in VS Code: a local agent with Claude Opus writing and fixing unit tests with him in the loop, a background agent using a git work tree to build a front end from a GitHub issue without him touching it, a

Scaling Agents on Kubernetes with acpx and ACP — Onur Solmaz, OpenClaw video thumbnail Play video
AI Engineer YouTube May 21, 2026 video

Scaling Agents on Kubernetes with acpx and ACP — Onur Solmaz, OpenClaw

OpenClaw receives 300 to 500 pull requests per day. Most arrive AI generated, most are not mergeable, and every one of them is signal about something broken in the codebase. Onur Solmaz built acpx to process them without him in the loop. acpx is a headless CLI for the Agent Client Protocol. It replaces PTY scraping wit

Your Coding Agent Should Do AI System Engineering — Ben Burtenshaw, Hugging Face video thumbnail Play video
AI Engineer YouTube May 21, 2026 video

Your Coding Agent Should Do AI System Engineering — Ben Burtenshaw, Hugging Face

An agent written RMSNorm kernel hit 1.88x speedups on H100s. A finetuned Qwen3 0.6B hit 35% on LiveCodeBench. Neither result required a systems engineer. Just coding agents with the right skills loaded. Ben Burtenshaw from Hugging Face walks through three levels: using Claude Code interactively to write and benchmark C

Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind video thumbnail Play video
AI Engineer YouTube May 20, 2026 video

Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind

Draw arrows on a map and ask Gemini to generate a picture of what you see. It produces the Golden Gate Bridge. Not because it matched pixels, but because the image generation model is built on top of Gemini's world understanding and knows what those arrows are pointing at. Patrick Löber walks through the full any-to-an

Skill issue: Lessons from skilling up coding agents to use Langfuse - Marc Klingen, Clickhouse video thumbnail Play video
AI Engineer YouTube May 20, 2026 video

Skill issue: Lessons from skilling up coding agents to use Langfuse - Marc Klingen, Clickhouse

Without a skill, Claude Code adds Langfuse using stale pre-training context, ships broken instrumentation, then catches the failure and fetches current docs to fix it. The resulting trace captures two LLM calls with no visibility into what the agent actually did. Marc Klingen covers the six learnings from building a sk

From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google video thumbnail Play video
AI Engineer YouTube May 20, 2026 video

From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google

Function Gemma ships at 270 million parameters and processes nearly 2,000 tokens per second prefill on a Pixel 7. Out of the box, on a fixed set of app intents, it hits 46% accuracy. Fine-tuned on a synthetically generated dataset, it clears 90% on eight of ten functions. Cormac Brick covers the two options developers

What Breaks When You Build AI Under Sovereignty Constraints - Bilge Yücel, deepset GmbH video thumbnail Play video
AI Engineer YouTube May 19, 2026 video

What Breaks When You Build AI Under Sovereignty Constraints - Bilge Yücel, deepset GmbH

If you send EU citizen data to an embedding API hosted in Virginia, you have already violated GDPR. That is one hidden assumption. Most production AI systems have dozens more, baked into the architecture long before anyone asked whether the system was sovereign. Bilge Yücel walks through the four sovereignty pillars (d

Don't Build Slop (4 Levels of AI Agent Maturity) - Ara Khan, Cline video thumbnail Play video
AI Engineer YouTube May 19, 2026 video

Don't Build Slop (4 Levels of AI Agent Maturity) - Ara Khan, Cline

The prompt for GPT-5.3 is one-third the size of the one written for GPT-5. Frontier models are so capable that longer system prompts cause sensory overload and degrade performance. The rule Ara Khan keeps returning to: every single thing you add to an agent risks making it worse. The talk breaks agent-building into fou

Personalization in the Era of LLMs - Shivam Verma, Spotify video thumbnail Play video
AI Engineer YouTube May 19, 2026 video

Personalization in the Era of LLMs - Shivam Verma, Spotify

Spotify represents Ariana Grande and Bruno Mars as sequences of six tokens. The first two are shared because both are pop artists. The remaining tokens diverge to capture what makes each distinct. That is a Semantic ID, and it is how Spotify teaches open-weight LLMs to reason over a catalog of 100 million tracks the sa

Rewiring the State — Eoin Mulgrew, 10 Downing Street video thumbnail Play video
AI Engineer YouTube May 18, 2026 video

Rewiring the State — Eoin Mulgrew, 10 Downing Street

The cabinet office was about to spend one and a half million pounds on an outside law firm to analyze the UK statute book. One engineer embedded with the in-house legal team for two weeks instead. The tool now lives with that team and can be run whenever they want. Eoin Mulgrew from the Number 10 data science team uses

Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind video thumbnail Play video
AI Engineer YouTube May 18, 2026 video

Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind

Guillaume Vernade from Google DeepMind takes a public domain book and runs it through the full gen media stack live. Gemini reads the whole text and writes image prompts for each character and chapter. Imagen generates the portraits. Veo animates them into video clips using those images as first frames. Lyria composes

Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic video thumbnail Play video
AI Engineer YouTube May 18, 2026 video

Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic

Why self-evaluation is a trap and adversarial evaluator agents work better; why context compaction doesn't cure coherence drift but structured handoffs do; how to decompose work into testable sprint contracts; how to grade subjective output with rubrics an LLM can actually apply; and how to read traces as your primary

Why Your AI UX Is Broken (and It's Not the Model's Fault) — Mike Christensen, Ably video thumbnail Play video
AI Engineer YouTube May 17, 2026 video

Why Your AI UX Is Broken (and It's Not the Model's Fault) — Mike Christensen, Ably

SSE ties a response stream to a single connection. The user refreshes the page, walks out of WiFi range, or opens a second tab and the in-progress response is gone. Abort and resume are mutually exclusive for the same reason: the only signal a client can send over a one-way pipe is closing it, so the agent cannot tell

Beyond Code Coverage: Functionality Testing with Playwright — Marlene Mhangami, Microsoft video thumbnail Play video
AI Engineer YouTube May 16, 2026 video

Beyond Code Coverage: Functionality Testing with Playwright — Marlene Mhangami, Microsoft

When an LLM writes your tests, it tends to write tests that confirm what the code does rather than tests that verify what the user experiences. Your test suite goes green. The app still breaks in ways none of those tests would catch. Marlene Mhangami from Microsoft makes the case for flipping the order: get the agent t

How to Leverage Domain Expertise — Chris Lovejoy, Notius Labs video thumbnail Play video
AI Engineer YouTube May 16, 2026 video

How to Leverage Domain Expertise — Chris Lovejoy, Notius Labs

Granola's first employee was a writer who still reviews meeting note outputs and tweaks prompts directly. Chris Lovejoy says that is not a gap in the org chart. There is no objectively perfect meeting note, so you need someone with taste doing both the assessment and the improvement. He frames this as one of three patt

Connecting the Dots with Context Graphs — Stephen Chin, Neo4j video thumbnail Play video
AI Engineer YouTube May 16, 2026 video

Connecting the Dots with Context Graphs — Stephen Chin, Neo4j

Ask a vector RAG system about a patient's emphysema care plan and it returns generic advice: respiratory therapy, deep breathing. Give it a graph grounded in that patient's actual history and it knows they smoke, knows they've had an operation, and gives recommendations that reflect it. The information existed in both

Agents Don't Do Standups: Building the Post-Engineer Engineering Org — Mike Spitz, PFF video thumbnail Play video
AI Engineer YouTube May 15, 2026 video

Agents Don't Do Standups: Building the Post-Engineer Engineering Org — Mike Spitz, PFF

PFF ran a three-month case study: two engineers against a team of ten, same codebase, same customers. The two shipped five times a day. The ten shipped once every five days. Output measured by ticket complexity came out at 10x. Customer satisfaction went up, not down. Mike Spitz, their CTO, started with one reframe: st

Combine Skills and MCP to Close the Context Gap — Pedro Rodrigues, Supabase video thumbnail Play video
AI Engineer YouTube May 15, 2026 video

Combine Skills and MCP to Close the Context Gap — Pedro Rodrigues, Supabase

Agents working with Postgres will confidently create a view over a table with row-level security enabled and silently bypass that security in the process. Not because they can't reason. Because they don't know about the security_invoker flag, and nobody told them. Pedro Rodrigues from Supabase ran this exact test: same

How Building with AI Can Double the Throughput of Your Engineering Team — Brian Scanlan, Intercom video thumbnail Play video
AI Engineer YouTube May 15, 2026 video

How Building with AI Can Double the Throughput of Your Engineering Team — Brian Scanlan, Intercom

Intercom hit 2x engineering throughput in under a year. Not by prompting better. By treating Claude Code like a new hire: onboarding it to a Rails monolith built over 15 years, writing skills for every recurring task, connecting it to production systems and internal tooling, and going all in on one platform instead of

Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize video thumbnail Play video
AI Engineer YouTube May 14, 2026 video

Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize

Most agents get tested by running a few queries and checking if it looks right. Laurie calls this the vibes problem: it doesn't catch regressions, doesn't run in CI, and doesn't tell you whether a prompt fix broke three other things. This workshop builds a complete eval pipeline from scratch on a financial analysis age

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft video thumbnail Play video
AI Engineer YouTube May 14, 2026 video

Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft

Agents drift. Models change, prompts get tweaked, edge cases accumulate, and the gap between what your agent does and what you need it to do widens without you noticing. Amy and Nitya walk through Microsoft Foundry's observability stack: tracing built on OpenTelemetry, built-in evaluators for quality, safety, and agent

Make your own event-sourced agent harness using stream processors — Jonas Templestein, Iterate video thumbnail Play video
AI Engineer YouTube May 14, 2026 video

Make your own event-sourced agent harness using stream processors — Jonas Templestein, Iterate

The abstraction is three things: state, a synchronous reducer that derives state from events, and an after-append hook for side effects. The split matters: when your program restarts after 100 events, you want to catch up state without replaying LLM requests. Everything that happens (streaming chunks, tool calls, error

Your Agent Can Now Train Models — Merve Noyan, Hugging Face video thumbnail Play video
AI Engineer YouTube May 13, 2026 video

Your Agent Can Now Train Models — Merve Noyan, Hugging Face

Open-source models have caught up. GLM 5.1 is leading the Artificial Analysis intelligence index over closed models, and the gap is closing fast with each release cycle. The practical upside beyond benchmarks: full weight access means you can quantize, fine-tune, and deploy to edge devices or browsers without data leav

Building a Chess Coach — Anant Dole and Asbjorn Steinskog, Take Take Take video thumbnail Play video
AI Engineer YouTube May 13, 2026 video

Building a Chess Coach — Anant Dole and Asbjorn Steinskog, Take Take Take

LLMs can explain things clearly but can't play chess reliably. Take Take Take (Magnus Carlsen's app) solved this by separating concerns: Stockfish handles position evaluation, tactical and positional detectors extract concepts like forks, pins, and structural weaknesses, and the LLM's only job is translating those stru

CI/CD Is Dead, Agents Need Continuous Compute and Computers — Hugo Santos and Madison Faulkner video thumbnail Play video
AI Engineer YouTube May 13, 2026 video

CI/CD Is Dead, Agents Need Continuous Compute and Computers — Hugo Santos and Madison Faulkner

Traditional CI/CD was built for humans pushing one or two diffs a week. Scale to thousands of autonomous agents opening PRs continuously and you get runner saturation, cold Docker builds on every branch, cache thrash, and a merge queue that starts behaving like a serialized database lock where time-to-commit becomes th

Build & deploy AI-powered apps — Paige Bailey, Google DeepMind video thumbnail Play video
AI Engineer YouTube April 29, 2026 video

Build & deploy AI-powered apps — Paige Bailey, Google DeepMind

Got a massive idea but stuck in the "just talking about it" phase? This session cuts the fluff and dives straight into how to build and prototype at lightning speed using AI Studio Build and Antigravity for free. It breaks down Google DeepMind's AI tech stack so viewers know exactly which tools to use, when to reach fo

Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI video thumbnail Play video
AI Engineer YouTube April 29, 2026 video

Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

A new class of small models is emerging with the ability to reliably follow instructions and call tools while running on-device under 1 GB of memory. In this talk, we'll break down how to post-train frontier small models using the LFM2.5 recipe: on-policy preference alignment, agentic reinforcement learning, and curric

Building your own software factory — Eric Zakariasson, Cursor video thumbnail Play video
AI Engineer YouTube April 28, 2026 video

Building your own software factory — Eric Zakariasson, Cursor

Most of us are pair-programming with one agent and stopping there. There's a lot more on the table. This workshop is about going from one agent to many. We'll start with codebase setup, the foundational work that makes agents effective on their own. Then we'll scale up to running agents in parallel, kicking off async w

Why building eval platforms is hard — Phil Hetzel, Braintrust video thumbnail Play video
AI Engineer YouTube April 28, 2026 video

Why building eval platforms is hard — Phil Hetzel, Braintrust

An eval platform is not just a test runner. You are building shared definitions of "good," reliable data pipelines, labelling workflows, versioning, and trust in results across many teams and model changes. This session breaks down the hidden complexity, the common failure modes, and the design principles that make eva

One Login to Rule Them All: Cross-App Access for MCP — Garrett Galow, WorkOS video thumbnail Play video
AI Engineer YouTube April 28, 2026 video

One Login to Rule Them All: Cross-App Access for MCP — Garrett Galow, WorkOS

Connecting a coding agent to multiple services often means facing a dozen OAuth consent screens, a dozen token lifecycles, and a dozen chances for something to break. Despite having Single Sign-On, users still find themselves signing in repeatedly. This talk explores how Cross-App Access leverages a three-way trust bet

Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind video thumbnail Play video
AI Engineer YouTube April 27, 2026 video

Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind

Open models are getting smaller, faster, and far more capable. In this talk, Cassidy Hardin walks through the latest advances in the Gemma family, with a focus on Gemma 4 and what it enables for developers building on-device and open-weight AI systems. She covers the architecture behind Gemma’s dense, effective, and mi

Lessons from Scaling GitHub's Remote MCP Server — Sam Morrow, GitHub video thumbnail Play video
AI Engineer YouTube April 27, 2026 video

Lessons from Scaling GitHub's Remote MCP Server — Sam Morrow, GitHub

GitHub operates one of the most heavily-utilised MCP servers in the ecosystem, with over 4 million downloads of the stdio server alone. Discover the architectural decisions, technical challenges and lessons learned while building and scaling a remote MCP server on production infrastructure. The session walks through th

Bringing MCPs to the Enterprise — Karan Sampath, Anthropic video thumbnail Play video
AI Engineer YouTube April 27, 2026 video

Bringing MCPs to the Enterprise — Karan Sampath, Anthropic

MCPs are often flaky, face multiple security vulnerabilities, and are generally hard to scale. Most enterprises struggle to use more than single digit numbers of MCPs due to issues with security, observability, and access control. In this talk, we'll explore the approaches and learnings we at Anthropic have been taking

Collaborative AI Engineering — Maggie Appleton, GitHub Next video thumbnail Play video
AI Engineer YouTube April 26, 2026 video

Collaborative AI Engineering — Maggie Appleton, GitHub Next

Agentic engineering so far has been a solo story: one developer and a dozen agents moving at warp speed. But speed without thoughtful planning and team alignment is just wasting tokens. When everyone on a team is directing agents alone in their personal CLI tools with no shared context, you get duplicate work, conflict

Full Walkthrough: Workflow for AI Coding from Planning to Production — Matt Pocock (@mattpocockuk ) video thumbnail Play video
AI Engineer YouTube April 24, 2026 video

Full Walkthrough: Workflow for AI Coding from Planning to Production — Matt Pocock (@mattpocockuk )

A hands-on workshop covering the full lifecycle of AI-assisted development, from turning ambiguous requirements into agent-ready plans to running autonomous coding agents that ship production features. You'll learn to stress-test vague briefs into structured PRDs, slice work into thin "tracer bullet" vertical slices, a

GPT 5.5 Arrives, DeepSeek V4 Drops, and the Compute War Intensifies video thumbnail Play video
AI Explained YouTube April 24, 2026 video

GPT 5.5 Arrives, DeepSeek V4 Drops, and the Compute War Intensifies

GPT 5.5 full analysis, plus DeepSeek V4 paper highlights, comparisons with Mythos, a vibe-coded game w/ GPT Image 2, and 50 data-points you wouldn’t get from just reading the headlines. https://80000hours.org/aiexplained Check out my fast-growing (!) app, free to use, and code INSIDER15 for paid tiers: https://lmcounci

AIE Miami Day 2 ft. Cerebras, OpenCode, Cursor, Arize AI, and more! video thumbnail Play video
AI Engineer YouTube April 21, 2026 video

AIE Miami Day 2 ft. Cerebras, OpenCode, Cursor, Arize AI, and more!

April 21, 2026 - all times in EST -- 9:00am - Welcome to Day 2 -- 9:10am - David House, G2i Transforming Programming Mindsets: Case Studies in Agentic Coding Adoption -- 9:35am - Sarah Chieng, Cerebras Help! We're DEEP in (latency) Debt -- 10:00am - Lech Kalinowski, CallStack Ambient Generative AI: Deploying Latent Dif

AIE Miami Keynote & Talks ft. OpenCode. Google Deepmind, OpenAI, and more! video thumbnail Play video
AI Engineer YouTube April 20, 2026 video

AIE Miami Keynote & Talks ft. OpenCode. Google Deepmind, OpenAI, and more!

April 20, 2026 - all times in EST -- 9:00am - Welcome to AI Engineer Miami -- 9:10am - Gabe Greenberg, G2i Opening Remarks -- 9:15am - Dax Raad, OpenCode Keynote -- 9:40am - Dexter Horthy, HumanLayer Everything We got Wrong About RPI -- 10:05am - Max Stoiber, OpenAI Coming Soon -- 10:30am - Morning Break -- 11:00am - B

Anthropic Frontier Red Team April 7, 2026 news

Assessing Claude Mythos Preview’s cybersecurity capabilities

Claude Mythos Preview is a new general-purpose language model that is strikingly capable at computer security tasks. This post provides technical details for researchers and practitioners who want to understand exactly how we have been testing this model, and what we have found over the past month. We hope this will sh