Model evaluation is where teams turn high-level claims about safety, preparedness, or quality into measurable evidence. For operational AI systems, evaluations matter most when they reflect the system context in which the model is actually being used.
Model Evaluation
Safety evaluations, system cards, preparedness, and security measurement for frontier models.
- Capability, misuse, and safety behavior under realistic tasks
- System cards, preparedness reporting, and evidence for launch decisions
- Regression testing so known failures do not quietly reappear
- Benchmarks that do not match the deployed workflow
- Safety claims without repeatable evidence
- No connection between findings, mitigations, and re-testing
- Teams building evaluation pipelines
- Leaders interpreting evidence for safe deployment
- Security and policy teams interpreting model documentation
Current notes, events, and source material
These items are included because they add useful evidence, framing, implementation detail, or upcoming context for teams working in this area.
NeurIPS 2026
NeurIPS 2026 is the fortieth annual Conference on Neural Information Processing Systems, with the primary dates listed for Sydney, Australia, and additional satellite locations in Atlanta and Paris.
OpenAI DevDay 2026
OpenAI DevDay 2026 is scheduled for September 29 in San Francisco and is OpenAI’s primary developer event for platform updates.
ICML 2026
ICML 2026 takes place at COEX in Seoul, South Korea, with tutorials, main conference sessions, and workshops covering core machine learning research.
Data + AI Summit 2026
Data + AI Summit 2026 is Databricks’ global data and AI conference in San Francisco and online, with 800+ sessions across data engineering, analytics, ML, governance, and agent applications.
Play video
How I deleted 95% of my agent skills and got better results — Nick Nisi, WorkOS
Claude would fake running tests by touching the expected output file. Nick Ni, DX engineer at WorkOS, fixed it by SHA-256 hashing the actual test output and verifying it cryptographically. His principle: make it easier to do the real work than to lie about it, and enforce that through code and state machines, not promp
Play video
How We Built Zeta2: Training an Edit Prediction Model in Production — Ben Kunkle, Zed
To validate settled data, Zed ran 10 frontier model predictions per example and measured Levenshtein distance to the final state. For 100,000 training examples that is a million frontier model requests, which is prohibitively expensive. The fix: Zeta 2's student model now approaches teacher quality, so they run it 50 t
Play video
Why (Senior) Engineers Struggle to Build AI Agents — Philipp Schmid, Google DeepMind
A `deleteItem` endpoint is obvious to the developer who built it. An agent only sees the function schema and docstring. Philipp Schmid from Google DeepMind argues this is why senior engineers struggle most: they carry years of implicit context that agents do not, and design tools assuming it. He names four other shifts
How Braintrust turns customer requests into code with Codex
How Braintrust engineers use Codex with GPT-5.5 to run experiments and code faster.
A shared playbook for trustworthy third party evaluations
OpenAI shares guidance on third-party AI evaluations, covering how to assess model capabilities, safeguards, and validity for frontier systems.
Play video
Reachy Mini: the $300 open source robot you can actually hack — Andres Marafioti, Hugging Face
Qwen3-TTS shipped at 0.8x real time: one second of audio took 1.2 seconds to generate. Andres Marafioti from Hugging Face spent two weeks fixing it. The culprits were no streaming, 500 autoregressive steps per audio packet with a CPU GPU round trip on each, and a dynamic KV cache that blocked compilation. Static KV cac
Boston Children’s uses AI to unlock new diagnoses
Boston Children’s Hospital uses OpenAI technology to improve patient care, reduce operational burden, and help diagnose more than 40 rare disease cases.
Strengthening societal resilience with Rosalind Biodefense
OpenAI launches Rosalind Biodefense, expanding trusted access to GPT-Rosalind for vetted developers and U.S. government partners advancing biodefense, public health, and pandemic preparedness through frontier AI.
Cloud CISO Perspectives: How to build an AI-ready security program for the public sector
From industrial control systems to decades-old municipal databases, here’s our CISO guidance to prep AI-ready security programs for the public sector.
Play video
New Claude Opus 4.8: 15 Things You May’ve Missed
The ‘best’ generally available AI model just dropped, but there is plenty I bet you missed about what it is, how it performs, and what the release tells us. 15 highlights from the 244 page system card, plus private testing, leader interview and more. AI Insiders ($9!): https://www.patreon.com/AIExplained Chapters: 00:0
Play video
Why your agents need decision traces, not just documents — Zach Blumenfeld, Neo4j
A knowledge base tells a financial analyst agent the risk factors. A context graph tells it whether to reject or accept, because it also carries past decision traces, the reasoning behind them, and how similar cases resolved. Zach from Neo4j walks through how context graphs extend a standard RAG setup with three layers
Play video
Reverse engineering a Viking VOIP phone protocol with Claude Code — Boris Starkov, Eleven Labs
A Viking VoIP phone sat in the ElevenLabs San Francisco office for a year. Three senior engineers and ChatGPT could not get it working. Boris from ElevenLabs cracked the undocumented protocol with Claude Code in a couple of days: brute forced all 676 possible two letter command combinations, found 80 valid ones, then s
How Endava builds an agentic organization with Codex
Learn how Endava uses Codex to build an agentic organization, accelerating software delivery and reducing requirements analysis from weeks to hours.
MUFG aims to become AI-native with OpenAI
MUFG uses ChatGPT Enterprise to build an AI-native organization, improve workflows, and deliver new AI-powered financial services at scale.
Play video
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
Traditional observability answers one question: is the system up? Phil Hetzel from Braintrust argues that question is not the right one for agents. An individual agent trace can exceed a gigabyte. A single span can hit 20 megabytes. The data is semistructured, packed with unstructured text, and still arrives in real ti
Play video
Most Enterprise Agentic Projects Are Doomed, Here's Why — Jess Grogan-Avignon & Jack Wang, Accenture
Jess Grogan-Avignon and Jack Wang at Accenture built an agentic application in two weeks. Getting it to production took another 12 months. Not because the code was wrong. Because the infrastructure team, the security team, the AI gateway team, the data governance team, and the application team all had to align before a
OpenAI’s Frontier Governance Framework
Explore OpenAI’s Frontier Governance Framework and how our AI safety, security, and risk practices align with emerging EU and California regulations.
Play video
Context Graphs for Explainable, Decision-Aware AI Agents — Andreas Kollegger & Zaid Zaim, Neo4j
Prescribing drug X is correct 99% of the time for symptom Y. For the 1% where it is fatal, statistical reasoning does not help you. Andreas Kollegger calls this reference class validation: before the agent acts, it has to know which group it is in. Context graphs give agents the why. Not just knowledge and tools but th
Cisco and OpenAI redefine enterprise engineering with Codex
Cisco and OpenAI are redefining enterprise engineering with Codex, helping Cisco scale AI-native development, accelerate AI Defense work, and automate defect remediation.
Election information and safeguards in 2026
Ahead of global elections, we’re helping people access information, supporting cyber defenders, and increasing AI transparency
Warp’s big bet on building open source with GPT-5.5
Warp uses GPT-5.5 and OpenAI models to coordinate coding agents across local, cloud, and open-source development workflows.
Play video
Comprehend First, Code Later: The AI Skill I Rely On Daily — Priscila Andre de Oliveira, Sentry
Priscila Andre de Oliveira analyzed 116 of her own Claude sessions from daily work at Sentry. 67% were comprehension. 2% were code generation. Working in a codebase with 15 years of history, around 100 PRs merged per day, and 100,000 organizations depending on it, the unlock is not generation but understanding. She bui
Building self-improving tax agents with Codex
See how OpenAI, Thrive, and Crete built a self-improving tax agent with Codex, automating filings, improving accuracy, and accelerating workflows.
Introducing Google AI Threat Defense to help you outpace the adversary
AI Threat Defense is a comprehensive AI-powered cybersecurity solution, an always-on security platform to outpace AI-driven attacks.
Play video
Why Rust is the Ideal Language for Vibe-Coding — Daniel Szoke, Sentry
TypeScript is easy for models to write because it imposes few constraints. Those same missing constraints let models introduce data races that compile, run, and only fail intermittently. A thread safety bug in Rust does not compile. The compiler names the unsound type, explains why it cannot be sent between threads, an
Play video
The maturity phases of running evals — Phil Hetzel, Braintrust
Most teams approach evals like unit tests and try to cover every possible failure. Phil Hetzel from Braintrust argues that is the wrong frame: enumerate your known failure modes, cover those specifically, and ship. The goal is a flywheel where production traces surface what is going wrong, feed back into offline experi
ACM CAIS 2026
ACM CAIS 2026 is a research-focused conference on compound AI architectures, optimization, deployment, and agentic AI systems in San Jose, California.
Play video
Run Frontier AI at Home — Alex Cheema, EXO Labs
Running GLM 5.1, a trillion parameter model released the day before this workshop, across four Mac Studios costs around $40,000 in hardware and tops out at roughly 20 tokens per second. Alex Cheema from EXO Labs thinks both numbers have about 100x left in them. The workshop covers what that 100x looks like across the s
Play video
What the Best Agents Share — Mardu Swanepoel, Flinn AI
Harvey, Cursor, Manus, and Claude operate in completely different domains but share four patterns: focus modes that constrain the action space to improve output quality, transparent execution that surfaces tool calls and reasoning to build user trust, personalization that optimizes for speed to understanding rather tha
Play video
Stop babysitting your agents... — Brandon Walsenuk, Unblocked
Same prompt. Same agent. Same model. Without a context engine: 2.5 hours, 20.9 million tokens, multiple rounds of human correction, and code that compiled but would have broken the entire system if it shipped. With one: 25 minutes, 10.8 million tokens, and a senior engineer who gave one nitpick and approved the merge.
OpenAI, Grupo Folha and Grupo UOL announce strategic content partnership
OpenAI partners with Grupo Folha and Grupo UOL to bring trusted Brazilian journalism to ChatGPT, expanding access to news with attribution and transparency.
Play video
Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind
On SWE-Bench Pro, six frontier models land within a couple of percentage points of each other. The harness they run inside shifts performance by 22%. A competing lab once took a Kaggle benchmark, reran it with their own compaction settings, and published much better results. Neither number was wrong. Both were useless.
Play video
Does GenAI "belong" to data scientists? — Phil Hetzel, Braintrust
At most traditional enterprises, GenAI got handed to the ML platform team because it had AI in the name. Phil Hetzel from Braintrust argues that was the wrong move, not because data scientists lack value, but because Anthropic and OpenAI already ran the data pipeline. What is left is prompt and context engineering, dis
Play video
Bounded Autonomy: Between Free Will and Determinism — Angus J. McLean, Oliver
Angus McLean spent time building a complex agent application to generate his CV. Four letters beat it: HTML. He puts the improvement at 100x. The talk is from Oliver's AI Director, where agents generate around 4,000 creative assets a day for 200 plus brands, assets you have probably seen and had no idea were AI. The co
Play video
How Google DeepMind Runs Agents at Scale — KP Sawhney & Ian Ballantyne, Google DeepMind
Google DeepMind employees have worse token quotas than paying customers. That is not a mistake. KP Sawhney explains: customers get priority, and if an internal team spikes usage on a cluster someone monitoring 24/7 will just call and ask them to stop. This panel covers how DeepMind thinks about agents at scale from the
Play video
Scaling the Next Paradigm of Heterogeneous Intelligence — Adrian Bertagnoli, Callosum
A mixture of Qwen 3 VL8B and Kimi K2.5 beat the state of the art on Video Web Arena, outperforming the leading GPT and Gemini models by 18 and 25 percent while costing 3.7 times less and running 3 times faster. The reason it worked is that visual web navigation decomposes into subtasks that do not all need a frontier m
Play video
Your Agent Is an Infinite Canvas — RL Nabors, Dressed for Space
RL Nabors built a comic reader that renders inside Claude. Full panels, navigation, transcript mode, design matched to the original site. No browser tabs. She is reading her own web comic archive entirely through an agent, and it looks like the website. The talk is a case against chat as the permanent UI of agentic sof
Play video
The Missing Primitive for Agent Swarms — Lou Bichard, Ona
Stripe called theirs Minions. RAMP called theirs Inspect. Both are internal infrastructure for running fleets of background agents, and both teams built it from scratch. Lou Bichard's argument is that this shouldn't keep happening. The talk breaks down what agent swarm infrastructure actually needs: a runtime (largely
Play video
Prompt to Pipeline: Building with Google's Gen Media Stack — Paige & Guillaume, Google DeepMind
A public domain book, a notebook, and three gen media models. Guom from Google DeepMind fed Wind in the Willows into Gemini, generated character portraits with Nano Banana, animated chapter scenes with VO, and scored each chapter with LIA, all live in the workshop. The full three hour session covers more ground. Paige
Project Glasswing: An initial update
Anthropic reports early Project Glasswing results using Mythos Preview with infrastructure partners and external testers, including large-scale vulnerability discovery and a cautious disclosure posture.
How Virgin Atlantic ships faster with Codex
How Virgin Atlantic used Codex to ship its revamped mobile app on a fixed holiday travel deadline, reaching near-total unit test coverage and zero P1 defects.
Measuring LLMs' Ability to Develop Exploits
Anthropic evaluates Mythos Preview against ExploitBench, ExploitGym, and an updated smart-contract exploitation benchmark, showing a step change in models that can turn vulnerabilities into working exploit chains.
Play video
Fast Models Need Slow Developers — Sarah Chieng, Cerebras
Codex Spark, a model Cerebras built with OpenAI, generates code at 1,200 tokens per second. The Sonnet and Opus families run at 40 to 60. At that 20x difference, a context window that used to take ten minutes to fill now takes 30 seconds, and every habit built around slow generation starts producing technical debt at a
Play video
Lobster Trap: OpenClaw in Containers from Local to K8s and Back — Sally Ann O'Malley, Red Hat
Sharing a good agent setup usually means handing someone a pile of markdown, config files, and YAML and hoping they reproduce what you have. The answer in this demo is a container image: spin up a sub agent in two seconds from a Podman command, flip a flag for Kubernetes, and your personal setup becomes the team baseli
OpenAI named a Leader in enterprise coding agents by Gartner
OpenAI is named a leader in the 2026 Gartner Magic Quadrant for Enterprise AI Coding Agents, with Codex recognized for innovation and enterprise-scale deployment.
Play video
AI on Android: Ask me Anything — Florina Muntenescu & Oli Gaymond, Google DeepMind
Gemini Nano on device weighs three to four gigabytes. Shipping that per app is not realistic, which is why AI core puts it in the system once and every app shares it. Foreground apps get top priority. Background batch jobs queue and run overnight on charge. The developer never manages any of that. The tradeoff is reach
AdventHealth advances whole-person care with OpenAI
AdventHealth is using ChatGPT for Healthcare to streamline workflows, reduce administrative burden, and return more time to patient care.
Play video
Cooking with Agents in VS Code — Liam Hampton, Microsoft
One codebase, three problems, three agents running at the same time. Liam Hampton from Microsoft demos the full loop in VS Code: a local agent with Claude Opus writing and fixing unit tests with him in the loop, a background agent using a git work tree to build a front end from a GitHub issue without him touching it, a
Play video
Scaling Agents on Kubernetes with acpx and ACP — Onur Solmaz, OpenClaw
OpenClaw receives 300 to 500 pull requests per day. Most arrive AI generated, most are not mergeable, and every one of them is signal about something broken in the codebase. Onur Solmaz built acpx to process them without him in the loop. acpx is a headless CLI for the Agent Client Protocol. It replaces PTY scraping wit
Play video
Your Coding Agent Should Do AI System Engineering — Ben Burtenshaw, Hugging Face
An agent written RMSNorm kernel hit 1.88x speedups on H100s. A finetuned Qwen3 0.6B hit 35% on LiveCodeBench. Neither result required a systems engineer. Just coding agents with the right skills loaded. Ben Burtenshaw from Hugging Face walks through three levels: using Claude Code interactively to write and benchmark C
An OpenAI model has disproved a central conjecture in discrete geometry
An OpenAI model solved the 80-year-old unit distance problem, disproving a major conjecture in discrete geometry and marking a milestone in AI-driven mathematics.
How Ramp engineers accelerate code review with Codex
How Ramp engineers use Codex with GPT-5.5 to review code and ship improvements, allowing them to get substantive feedback in minutes instead of hours.
Play video
Two Rival Bets on AGI: Google I/O Highlights
The biggest Google AI push of the year, but what is the bigger story? Why is Google pursuing a different fork in the road than OpenAI or Anthropic? https://assemblyai.com/aiexplained What does Gemini 3.5 Flash mean for the near-term future of AI? Plus the highlights from a provocative new paper on AI, 8 key moments you
Play video
Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind
Draw arrows on a map and ask Gemini to generate a picture of what you see. It produces the Golden Gate Bridge. Not because it matched pixels, but because the image generation model is built on top of Gemini's world understanding and knows what those arrows are pointing at. Patrick Löber walks through the full any-to-an
The next phase of OpenAI’s Education for Countries
OpenAI advances Education for Countries, expanding AI adoption in schools with new partnerships, teacher training, and tools to improve global learning outcomes.
Play video
Skill issue: Lessons from skilling up coding agents to use Langfuse - Marc Klingen, Clickhouse
Without a skill, Claude Code adds Langfuse using stale pre-training context, ships broken instrumentation, then catches the failure and fetches current docs to fix it. The resulting trace captures two LLM calls with no visibility into what the agent actually did. Marc Klingen covers the six learnings from building a sk
Play video
From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google
Function Gemma ships at 270 million parameters and processes nearly 2,000 tokens per second prefill on a Pixel 7. Out of the box, on a fixed set of app intents, it hits 46% accuracy. Fine-tuned on a synthetically generated dataset, it clears 90% on eight of ten functions. Cormac Brick covers the two options developers
Introducing OpenAI for Singapore
OpenAI for Singapore launches a multi-year AI partnership to expand deployment, build local talent, and support businesses and public services with AI.
Advancing content provenance for a safer, more transparent AI ecosystem
OpenAI advances AI content provenance with Content Credentials, SynthID, and a verification tool to help people identify and trust AI-generated media.
Play video
What Breaks When You Build AI Under Sovereignty Constraints - Bilge Yücel, deepset GmbH
If you send EU citizen data to an embedding API hosted in Virginia, you have already violated GDPR. That is one hidden assumption. Most production AI systems have dozens more, baked into the architecture long before anyone asked whether the system was sovereign. Bilge Yücel walks through the four sovereignty pillars (d
Play video
Don't Build Slop (4 Levels of AI Agent Maturity) - Ara Khan, Cline
The prompt for GPT-5.3 is one-third the size of the one written for GPT-5. Frontier models are so capable that longer system prompts cause sensory overload and degrade performance. The rule Ara Khan keeps returning to: every single thing you add to an agent risks making it worse. The talk breaks agent-building into fou
Play video
Personalization in the Era of LLMs - Shivam Verma, Spotify
Spotify represents Ariana Grande and Bruno Mars as sequences of six tokens. The first two are shared because both are pop artists. The remaining tokens diverge to capture what makes each distinct. That is a Semantic ID, and it is how Spotify teaches open-weight LLMs to reason over a catalog of 100 million tracks the sa
OpenAI and Dell partner to bring Codex to hybrid and on-premise enterprise environments
OpenAI and Dell partner to bring Codex to hybrid and on-premise environments, helping enterprises deploy AI coding agents securely across data and workflows.
Play video
Rewiring the State — Eoin Mulgrew, 10 Downing Street
The cabinet office was about to spend one and a half million pounds on an outside law firm to analyze the UK statute book. One engineer embedded with the in-house legal team for two weeks instead. The tool now lives with that team and can be run whenever they want. Eoin Mulgrew from the Number 10 data science team uses
Play video
Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind
Guillaume Vernade from Google DeepMind takes a public domain book and runs it through the full gen media stack live. Gemini reads the whole text and writes image prompts for each character and chapter. Imagen generates the portraits. Veo animates them into video clips using those images as first frames. Lyria composes
Play video
Build Agents That Run for Hours (Without Losing the Plot) — Ash Prabaker & Andrew Wilson, Anthropic
Why self-evaluation is a trap and adversarial evaluator agents work better; why context compaction doesn't cure coherence drift but structured handoffs do; how to decompose work into testable sprint contracts; how to grade subjective output with rubrics an LLM can actually apply; and how to read traces as your primary
Play video
Harnesses in AI: A Deep Dive — Tejas Kumar, IBM
The agent hit a login page, panicked, reported success anyway, and the upvote never happened. Tejas Kumar's diagnosis: not a prompt problem. A harness problem. The demo builds a browser agent on GPT-3.5 Turbo (consciously choosing a VERY old model to show how good harness eng can improve it a lot) against Hacker News a
Play video
Fighting AI with AI — Lawrence Jones, Incident
Incident's AI SRE runs hundreds of prompts per investigation across logs, metrics, traces, and code. When it produces a wrong root cause analysis, there is no tractable way for a human to read through the full trace and find where the reasoning went sideways. Lawrence Jones, founding engineer at Incident.io, describes
Play video
Why Your AI UX Is Broken (and It's Not the Model's Fault) — Mike Christensen, Ably
SSE ties a response stream to a single connection. The user refreshes the page, walks out of WiFi range, or opens a second tab and the in-progress response is gone. Abort and resume are mutually exclusive for the same reason: the only signal a client can send over a one-way pipe is closing it, so the agent cannot tell
Play video
AIE Singapore Day 2 ft. Google DeepMind, OpenClaw, Adaption, Arize, Cloudflare, Robot Company & more
May 17, 2026 - all times in SGT -- 9am - kickoff https://www.ai.engineer/singapore#schedule join us in person and on all side events https://luma.com/1eofvp02?tk=kN58jG
Play video
Beyond Code Coverage: Functionality Testing with Playwright — Marlene Mhangami, Microsoft
When an LLM writes your tests, it tends to write tests that confirm what the code does rather than tests that verify what the user experiences. Your test suite goes green. The app still breaks in ways none of those tests would catch. Marlene Mhangami from Microsoft makes the case for flipping the order: get the agent t
Play video
How to Leverage Domain Expertise — Chris Lovejoy, Notius Labs
Granola's first employee was a writer who still reviews meeting note outputs and tweaks prompts directly. Chris Lovejoy says that is not a gap in the org chart. There is no objectively perfect meeting note, so you need someone with taste doing both the assessment and the improvement. He frames this as one of three patt
OpenAI and Malta partner to bring ChatGPT Plus to all citizens
OpenAI and Malta partner to expand AI access, offering ChatGPT Plus and training to help citizens build practical AI skills and use AI responsibly.
Play video
Connecting the Dots with Context Graphs — Stephen Chin, Neo4j
Ask a vector RAG system about a patient's emphysema care plan and it returns generic advice: respiratory therapy, deep breathing. Give it a graph grounded in that patient's actual history and it knows they smoke, knows they've had an operation, and gives recommendations that reflect it. The information existed in both
How business operations teams use Codex
See how business operations teams can use Codex to create initiative briefs, strategy updates, leadership decision packets, progress updates, and more from real work inputs.
Databricks brings GPT-5.5 to enterprise agent workflows
Databricks uses GPT-5.5 for enterprise agent workflows after the model set a new state of the art on the OfficeQA Pro benchmark.
How data science teams use Codex
See how data science teams can use Codex to build root-cause briefs, impact readouts, KPI memos, scoped analyses, and dashboard specs from real work inputs.
A new personal finance experience in ChatGPT
Preview a new personal finance experience in ChatGPT for Pro users in the U.S. Securely connect your financial accounts and get AI-powered insights and guidance grounded in your financial context, goals, and priorities.
How sales teams use Codex
See how sales teams can use Codex to create pipeline briefs, meeting prep packets, forecast reviews, account plans, and stalled-deal diagnoses from real work inputs.
Play video
Agents Don't Do Standups: Building the Post-Engineer Engineering Org — Mike Spitz, PFF
PFF ran a three-month case study: two engineers against a team of ten, same codebase, same customers. The two shipped five times a day. The ten shipped once every five days. Output measured by ticket complexity came out at 10x. Customer satisfaction went up, not down. Mike Spitz, their CTO, started with one reframe: st
Play video
Combine Skills and MCP to Close the Context Gap — Pedro Rodrigues, Supabase
Agents working with Postgres will confidently create a view over a table with row-level security enabled and silently bypass that security in the process. Not because they can't reason. Because they don't know about the security_invoker flag, and nobody told them. Pedro Rodrigues from Supabase ran this exact test: same
Play video
How Building with AI Can Double the Throughput of Your Engineering Team — Brian Scanlan, Intercom
Intercom hit 2x engineering throughput in under a year. Not by prompting better. By treating Claude Code like a new hire: onboarding it to a Rails monolith built over 15 years, writing skills for every recurring task, connecting it to production systems and internal tooling, and going all in on one platform instead of
Sea's View on the Future of Agentic Software Development with Codex
Sea Limited's CPO explains why the company is deploying Codex across engineering teams to accelerate AI-native software development in Asia.
Work with Codex from anywhere
Use Codex anywhere with the ChatGPT mobile app. Monitor, steer, and approve coding tasks in real time across devices and remote environments.
Helping ChatGPT better recognize context in sensitive conversations
Learn how new ChatGPT safety updates improve context awareness in sensitive conversations, helping detect risk over time and respond more safely.
Cloud CISO Perspectives: How Google + Wiz changes multicloud strategy for CISOs
By centering developers and shifting security left, Wiz has seen a significant increase in security resolution. Here’s why this strategy matters for CISOs.
Play video
AIE Singapore Day 1 ft. Minister, NanoClaw, OpenAI, Google, Vercel, Cursor & more
May 16, 2026 - all times in SGT -- 8.30am - kickoff https://www.ai.engineer/singapore#schedule join us in person and on all side events https://luma.com/1eofvp02?tk=kN58jG
Play video
Ship Real Agents: Hands-On Evals for Agentic Applications — Laurie Voss, Arize
Most agents get tested by running a few queries and checking if it looks right. Laurie calls this the vibes problem: it doesn't catch regressions, doesn't run in CI, and doesn't tell you whether a prompt fix broke three other things. This workshop builds a complete eval pipeline from scratch on a financial analysis age
Play video
Mind the Gap (In your Agent Observability) — Amy Boyd & Nitya Narasimhan, Microsoft
Agents drift. Models change, prompts get tweaked, edge cases accumulate, and the gap between what your agent does and what you need it to do widens without you noticing. Amy and Nitya walk through Microsoft Foundry's observability stack: tracing built on OpenTelemetry, built-in evaluators for quality, safety, and agent
Play video
Make your own event-sourced agent harness using stream processors — Jonas Templestein, Iterate
The abstraction is three things: state, a synchronous reducer that derives state from events, and an after-append hook for side effects. The split matters: when your program restarts after 100 events, you want to catch up state without replaying LLM requests. Everything that happens (streaming chunks, tool calls, error
Building a safe, effective sandbox to enable Codex on Windows
Learn how OpenAI built a secure sandbox for Codex on Windows, enabling safe, efficient coding agents with controlled file access and network restrictions.
Our response to the TanStack npm supply chain attack
OpenAI describes its response to the TanStack npm supply-chain attack, including certificate rotation for macOS apps and guidance to update ChatGPT, Codex, and related desktop tooling from official channels.
The new era of SaMD: Why cloud infrastructure is the foundation for digital health in 2026
As SaMD moves from reactive diagnostics to proactive learning systems, cloud has become a superior foundation for regulated medical software.
Play video
Your Agent Can Now Train Models — Merve Noyan, Hugging Face
Open-source models have caught up. GLM 5.1 is leading the Artificial Analysis intelligence index over closed models, and the gap is closing fast with each release cycle. The practical upside beyond benchmarks: full weight access means you can quantize, fine-tune, and deploy to edge devices or browsers without data leav
Play video
Building a Chess Coach — Anant Dole and Asbjorn Steinskog, Take Take Take
LLMs can explain things clearly but can't play chess reliably. Take Take Take (Magnus Carlsen's app) solved this by separating concerns: Stockfish handles position evaluation, tactical and positional detectors extract concepts like forks, pins, and structural weaknesses, and the LLM's only job is translating those stru
Play video
CI/CD Is Dead, Agents Need Continuous Compute and Computers — Hugo Santos and Madison Faulkner
Traditional CI/CD was built for humans pushing one or two diffs a week. Scale to thousands of autonomous agents opening PRs continuously and you get runner saturation, cold Docker builds on every branch, cache thrash, and a merge queue that starts behaving like a serialized database lock where time-to-commit becomes th
Beyond source code: The files AI coding agents trust — and attackers exploit
As AI coding agents become embedded in developer workflows, defenders must rethink how to protect against malicious files. Here’s what you need to know.
How finance teams use Codex
See how finance teams can use Codex to build MBRs, reporting packs, variance bridges, model checks, and planning scenarios from real work inputs.
How NVIDIA engineers and researchers build with Codex
Teams use Codex with GPT-5.5 to ship production systems and turn research ideas into runnable experiments.
What Parameter Golf taught us about AI-assisted research
Parameter Golf brought together 1,000+ participants and 2,000+ submissions to explore AI-assisted machine learning research, coding agents, quantization, and novel model design under strict constraints.
Play video
Build & deploy AI-powered apps — Paige Bailey, Google DeepMind
Got a massive idea but stuck in the "just talking about it" phase? This session cuts the fluff and dives straight into how to build and prototype at lightning speed using AI Studio Build and Antigravity for free. It breaks down Google DeepMind's AI tech stack so viewers know exactly which tools to use, when to reach fo
Play video
Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI
A new class of small models is emerging with the ability to reliably follow instructions and call tools while running on-device under 1 GB of memory. In this talk, we'll break down how to post-train frontier small models using the LFM2.5 recipe: on-policy preference alignment, agentic reinforcement learning, and curric
Anthropic Responsible Scaling Policy v3.2
Anthropic’s current Responsible Scaling Policy page lists v3.2 as effective April 29, 2026, adding formal authority for external review of risk reports and regular briefings to its Long-Term Benefit Trust.
Play video
Building your own software factory — Eric Zakariasson, Cursor
Most of us are pair-programming with one agent and stopping there. There's a lot more on the table. This workshop is about going from one agent to many. We'll start with codebase setup, the foundational work that makes agents effective on their own. Then we'll scale up to running agents in parallel, kicking off async w
Play video
Why building eval platforms is hard — Phil Hetzel, Braintrust
An eval platform is not just a test runner. You are building shared definitions of "good," reliable data pipelines, labelling workflows, versioning, and trust in results across many teams and model changes. This session breaks down the hidden complexity, the common failure modes, and the design principles that make eva
Play video
One Login to Rule Them All: Cross-App Access for MCP — Garrett Galow, WorkOS
Connecting a coding agent to multiple services often means facing a dozen OAuth consent screens, a dozen token lifecycles, and a dozen chances for something to break. Despite having Single Sign-On, users still find themselves signing in repeatedly. This talk explores how Cross-App Access leverages a three-way trust bet
Play video
Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind
Open models are getting smaller, faster, and far more capable. In this talk, Cassidy Hardin walks through the latest advances in the Gemma family, with a focus on Gemma 4 and what it enables for developers building on-device and open-weight AI systems. She covers the architecture behind Gemma’s dense, effective, and mi
Play video
Lessons from Scaling GitHub's Remote MCP Server — Sam Morrow, GitHub
GitHub operates one of the most heavily-utilised MCP servers in the ecosystem, with over 4 million downloads of the stdio server alone. Discover the architectural decisions, technical challenges and lessons learned while building and scaling a remote MCP server on production infrastructure. The session walks through th
Play video
Bringing MCPs to the Enterprise — Karan Sampath, Anthropic
MCPs are often flaky, face multiple security vulnerabilities, and are generally hard to scale. Most enterprises struggle to use more than single digit numbers of MCPs due to issues with security, observability, and access control. In this talk, we'll explore the approaches and learnings we at Anthropic have been taking
Play video
Collaborative AI Engineering — Maggie Appleton, GitHub Next
Agentic engineering so far has been a solo story: one developer and a dozen agents moving at warp speed. But speed without thoughtful planning and team alignment is just wasting tokens. When everyone on a team is directing agents alone in their personal CLI tools with no shared context, you get duplicate work, conflict
Play video
MCP = Mega Context Problem - Matt Carey
The best MCP server is the one you didn't have to build. At Cloudflare we have a lot of products. Our REST OpenAPI spec is over 2.3 million tokens. When teams started building MCP servers, they did what everyone does: cherry-picked important endpoints for their product, wrote some tool definitions and shipped a separat
Play video
Full Walkthrough: Workflow for AI Coding from Planning to Production — Matt Pocock (@mattpocockuk )
A hands-on workshop covering the full lifecycle of AI-assisted development, from turning ambiguous requirements into agent-ready plans to running autonomous coding agents that ship production features. You'll learn to stress-test vague briefs into structured PRDs, slice work into thin "tracer bullet" vertical slices, a
Play video
What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench
AI Engineer session on What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench. It adds practical context for how teams are building and operating AI systems in production.
Play video
GPT 5.5 Arrives, DeepSeek V4 Drops, and the Compute War Intensifies
GPT 5.5 full analysis, plus DeepSeek V4 paper highlights, comparisons with Mythos, a vibe-coded game w/ GPT Image 2, and 50 data-points you wouldn’t get from just reading the headlines. https://80000hours.org/aiexplained Check out my fast-growing (!) app, free to use, and code INSIDER15 for paid tiers: https://lmcounci
Play video
Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)
AI Engineer session on Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana). It adds practical context for how teams are building and operating AI systems in production.
Play video
AIE Miami Day 2 ft. Cerebras, OpenCode, Cursor, Arize AI, and more!
April 21, 2026 - all times in EST -- 9:00am - Welcome to Day 2 -- 9:10am - David House, G2i Transforming Programming Mindsets: Case Studies in Agentic Coding Adoption -- 9:35am - Sarah Chieng, Cerebras Help! We're DEEP in (latency) Debt -- 10:00am - Lech Kalinowski, CallStack Ambient Generative AI: Deploying Latent Dif
Play video
Gemma, DeepMind's Family of Open Models — Omar Sanseviero, Google DeepMind
AI Engineer session on Gemma, DeepMind's Family of Open Models, presented by Omar Sanseviero, Google DeepMind. It adds practical context for how teams are building and operating AI systems in production.
Play video
Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI
AI Engineer session on Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX, presented by Adrien Grondin, Locally AI. It adds practical context for how teams are building and operating AI systems in production.
Play video
AIE Miami Keynote & Talks ft. OpenCode. Google Deepmind, OpenAI, and more!
April 20, 2026 - all times in EST -- 9:00am - Welcome to AI Engineer Miami -- 9:10am - Gabe Greenberg, G2i Opening Remarks -- 9:15am - Dax Raad, OpenCode Keynote -- 9:40am - Dexter Horthy, HumanLayer Everything We got Wrong About RPI -- 10:05am - Max Stoiber, OpenAI Coming Soon -- 10:30am - Morning Break -- 11:00am - B
Play video
Frontier AI and the Future of Intelligence — Raia Hadsell, VP of Research, Google DeepMind
AI Engineer session on Frontier AI and the Future of Intelligence, presented by Raia Hadsell, VP of Research, Google DeepMind. It adds practical context for how teams are building and operating AI systems in production.
Play video
Building pi in a World of Slop — Mario Zechner
AI Engineer session on Building pi in a World of Slop, presented by Mario Zechner. It adds practical context for how teams are building and operating AI systems in production.
Play video
Claude Opus 4.7 - A New Frontier, in Performance … and Drama
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
$1 AI Guardrails: The Unreasonable Effectiveness of Finetuned ModernBERTs — Diego Carpentero
AI Engineer session on $1 AI Guardrails: The Unreasonable Effectiveness of Finetuned ModernBERTs, presented by Diego Carpentero. It adds practical context for how teams are building and operating AI systems in production.
Play video
Let LLMs Wander: Engineering RL Environments — Stefano Fiorucci
AI Engineer session on Let LLMs Wander: Engineering RL Environments, presented by Stefano Fiorucci. It adds practical context for how teams are building and operating AI systems in production.
Play video
One Registry to Rule them All - Sonny Merla, Mauro Luchetti, & Mattia Redaelli, Quantyca
AI Engineer session on One Registry to Rule them All - Sonny Merla, Mauro Luchetti, & Mattia Redaelli, Quantyca. It adds practical context for how teams are building and operating AI systems in production.
Play video
Claude Mythos: Highlights from 244-page Release
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
NIST AI RMF and Critical Infrastructure Profile
NIST’s AI RMF hub now highlights its April 2026 concept note for a Trustworthy AI in Critical Infrastructure profile, extending the framework toward sector-specific operational risk management.
Assessing Claude Mythos Preview’s cybersecurity capabilities
Claude Mythos Preview is a new general-purpose language model that is strikingly capable at computer security tasks. This post provides technical details for researchers and practitioners who want to understand exactly how we have been testing this model, and what we have found over the past month. We hope this will sh
Play video
Two AI Models Set to “stir government urgency”, But Will This Challenge Undo Them?
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Reverse engineering Claude's CVE-2026-2796 exploit
This post dives deep into how Claude wrote an exploit for one of the vulnerabilities it found in Firefox.
Play video
What the New ChatGPT 5.4 Means for the World
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Deadline Day for Autonomous AI Weapons & Mass Surveillance
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Gemini 3.1 Pro and the Downfall of Benchmarks: Welcome to the Vibe Era of AI
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
The Two Best AI Models/Enemies Just Got Released Simultaneously
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
LLM-discovered 0-days
AI models can now find high-severity vulnerabilities at scale. This is a moment to empower defenders. We're now using Claude to find and help fix vulnerabilities in open source software.
Play video
Claude AI Co-founder Publishes 4 Big Claims about Near Future: Breakdown
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Identity for AI Agents - Patrick Riley & Carlos Galan, Auth0
AI Engineer session on Identity for AI Agents - Patrick Riley & Carlos Galan, Auth0. It adds practical context for how teams are building and operating AI systems in production.
Play video
Welcome to AIE CODE - Jed Borovik, Google DeepMind
AI Engineer session on Welcome to AIE CODE - Jed Borovik, Google DeepMind. It adds practical context for how teams are building and operating AI systems in production.
AI Models on Realistic Cyber Ranges
In a recent evaluation of AI models’ cyber capabilities, current Claude models can now succeed at multistage attacks on networks with dozens of hosts using only standard, open-source tools, instead of the custom tools needed by previous generations.
Play video
Anthropic: Our AI just created a tool that can ‘automate all white collar work’, Me:
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Finding Bugs with Claude and Property-based Testing
Ensuring that programs are bug-free is one of the most challenging aspects of software engineering. We developed an agent that can efficiently identify bugs in large software projects. Our agent infers general properties of code that should be true, and then applies property-based testing. After extensive manual valida
Experimenting with AI to Defend Critical Infrastructure
AI could help defenders of critical infrastructure identify the vulnerabilities that attackers might exploit—and close them before they are exploited. Anthropic has partnered with Pacific Northwest National Laboratory (PNNL) to explore this defensive application of AI, demonstrating both the potential of AI-accelerated
Play video
Building in the Gemini Era — Kat Kampf & Ammaar Reshi, Google DeepMind
AI Engineer session on Building in the Gemini Era, presented by Kat Kampf & Ammaar Reshi, Google DeepMind. It adds practical context for how teams are building and operating AI systems in production.
Play video
Code World Model: Building World Models for Computation — Jacob Kahn, FAIR Meta
AI Engineer session on Code World Model: Building World Models for Computation, presented by Jacob Kahn, FAIR Meta. It adds practical context for how teams are building and operating AI systems in production.
Play video
Defying Gravity - Kevin Hou, Google DeepMind
AI Engineer session on Defying Gravity - Kevin Hou, Google DeepMind. It adds practical context for how teams are building and operating AI systems in production.
Play video
Developing Taste in Coding Agents: Applied Meta Neuro-Symbolic RL — Ahmad Awais, CommandCode
AI Engineer session on Developing Taste in Coding Agents: Applied Meta Neuro-Symbolic RL, presented by Ahmad Awais, CommandCode. It adds practical context for how teams are building and operating AI systems in production.
Play video
Minimax M2: Building the #1 Open Model — Olive Song, MiniMax
AI Engineer session on Minimax M2: Building the #1 Open Model, presented by Olive Song, MiniMax. It adds practical context for how teams are building and operating AI systems in production.
Play video
RL Environments at Scale — Will Brown, Prime Intellect
AI Engineer session on RL Environments at Scale, presented by Will Brown, Prime Intellect. It adds practical context for how teams are building and operating AI systems in production.
Play video
What the Freakiness of 2025 in AI Tells Us About 2026
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Gemini Exponential, Demis Hassabis' ‘Proto-AGI’ coming, but …
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT 5.2: OpenAI Strikes Back
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
You Are Being Told Contradictory Things About AI
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Nano Banana Pro: But Did You Catch These 10 Details?
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Gemini 3 Pro: Breakdown
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Is GPT-5.1 Really an Upgrade? But Models Can Auto-Hack Govts, so … there’s that
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Bubble or No Bubble, AI Keeps Progressing (ft. Relentless Learning + Introspection)
This AI Explained video reviews a major AI development through the lens of multimodal generation and provenance. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Did you miss these 2 AI stories? A *Real* LLM-crafted Breakthrough + Continual Learning Blocked?
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Sora 2 - It will only get more realistic from here
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
OpenAI Tests if GPT-5 Can Automate Your Job - 4 Unexpected Findings
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
ChatGPT Can Now Call the Cops, but 'Wait till 2100 for Full Job Impact' - Altman
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
An ‘AI Bubble’? What Altman Actually said, the Facts and Nano Banana
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Real World Development with GitHub Copilot and VS Code — Harald Kirschner, Christopher Harrison
AI Engineer session on Real World Development with GitHub Copilot and VS Code, presented by Harald Kirschner, Christopher Harrison. It adds practical context for how teams are building and operating AI systems in production.
Play video
Real-time Experiments with an AI Co-Scientist - Stefania Druga, fmr. Google Deepmind
AI Engineer session on Real-time Experiments with an AI Co-Scientist - Stefania Druga, fmr. Google Deepmind. It adds practical context for how teams are building and operating AI systems in production.
Play video
The Rise of Open Models in the Enterprise — Amir Haghighat, Baseten
AI Engineer session on The Rise of Open Models in the Enterprise, presented by Amir Haghighat, Baseten. It adds practical context for how teams are building and operating AI systems in production.
Play video
What Is a Humanoid Foundation Model? An Introduction to GR00T N1 - Annika & Aastha
AI Engineer session on What Is a Humanoid Foundation Model? An Introduction to GR00T N1 - Annika & Aastha. It adds practical context for how teams are building and operating AI systems in production.
Play video
GPT-5 has Arrived
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Genie 3: The World Becomes Playable (DeepMind)
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
A year of Gemini progress + what comes next — Logan Kilpatrick, Google DeepMind
AI Engineer session on A year of Gemini progress + what comes next, presented by Logan Kilpatrick, Google DeepMind. It adds practical context for how teams are building and operating AI systems in production.
Play video
AI Engineering with the Google Gemini 2.5 Model Family - Philipp Schmid, Google DeepMind
AI Engineer session on AI Engineering with the Google Gemini 2.5 Model Family - Philipp Schmid, Google DeepMind. It adds practical context for how teams are building and operating AI systems in production.
Play video
Benchmarks Are Memes: How What We Measure Shapes AI — and Us - Alex Duffy, Every.to
AI Engineer session on Benchmarks Are Memes: How What We Measure Shapes AI, presented by and Us - Alex Duffy, Every.to. It adds practical context for how teams are building and operating AI systems in production.
Play video
How fast are LLM inference engines anyway? — Charles Frye, Modal
AI Engineer session on How fast are LLM inference engines anyway?, presented by Charles Frye, Modal. It adds practical context for how teams are building and operating AI systems in production.
Play video
How to build world-class AI products — Sarah Sachs (AI lead @ Notion) & Carlos Esteban (Braintrust)
AI Engineer session on How to build world-class AI products, presented by Sarah Sachs (AI lead @ Notion) & Carlos Esteban (Braintrust). It adds practical context for how teams are building and operating AI systems in production.
Play video
How to Train Your Agent: Building Reliable Agents with RL — Kyle Corbitt, OpenPipe
AI Engineer session on How to Train Your Agent: Building Reliable Agents with RL, presented by Kyle Corbitt, OpenPipe. It adds practical context for how teams are building and operating AI systems in production.
Play video
Measuring AGI: Interactive Reasoning Benchmarks for ARC-AGI-3 — Greg Kamradt, ARC Prize Foundation
AI Engineer session on Measuring AGI: Interactive Reasoning Benchmarks for ARC-AGI-3, presented by Greg Kamradt, ARC Prize Foundation. It adds practical context for how teams are building and operating AI systems in production.
Play video
Netflix's Big Bet: One model to rule recommendations: Yesu Feng, Netflix
AI Engineer session on Netflix's Big Bet: One model to rule recommendations: Yesu Feng, Netflix. It adds practical context for how teams are building and operating AI systems in production.
Play video
OpenThoughts: Data Recipes for Reasoning Models — Ryan Marten, Bespoke Labs
AI Engineer session on OpenThoughts: Data Recipes for Reasoning Models, presented by Ryan Marten, Bespoke Labs. It adds practical context for how teams are building and operating AI systems in production.
Play video
Optimizing inference for voice models in production - Philip Kiely, Baseten
AI Engineer session on Optimizing inference for voice models in production - Philip Kiely, Baseten. It adds practical context for how teams are building and operating AI systems in production.
Play video
Real world MCPs in GitHub Copilot Agent Mode — Jon Peck, Microsoft
AI Engineer session on Real world MCPs in GitHub Copilot Agent Mode, presented by Jon Peck, Microsoft. It adds practical context for how teams are building and operating AI systems in production.
Play video
RL for Autonomous Coding — Aakanksha Chowdhery, Reflection.ai
AI Engineer session on RL for Autonomous Coding, presented by Aakanksha Chowdhery, Reflection.ai. It adds practical context for how teams are building and operating AI systems in production.
Play video
The Bitter Layout or: How I Learned to Love the Model Picker — Maximillian Piras, Yutori
AI Engineer session on The Bitter Layout or: How I Learned to Love the Model Picker, presented by Maximillian Piras, Yutori. It adds practical context for how teams are building and operating AI systems in production.
Play video
Thinking Deeper in Gemini — Jack Rae, Google DeepMind
AI Engineer session on Thinking Deeper in Gemini, presented by Jack Rae, Google DeepMind. It adds practical context for how teams are building and operating AI systems in production.
Play video
Using OSS models to build AI apps with millions of users — Hassan El Mghari
AI Engineer session on Using OSS models to build AI apps with millions of users, presented by Hassan El Mghari. It adds practical context for how teams are building and operating AI systems in production.
Play video
Vector Search Benchmark[eting] - Philipp Krenn, Elastic
AI Engineer session on Vector Search Benchmark[eting] - Philipp Krenn, Elastic. It adds practical context for how teams are building and operating AI systems in production.
Play video
What every AI engineer needs to know about GPUs — Charles Frye, Modal
AI Engineer session on What every AI engineer needs to know about GPUs, presented by Charles Frye, Modal. It adds practical context for how teams are building and operating AI systems in production.
Play video
How Not to Read a Headline on AI (ft. new Olympiad Gold, GPT-5 …)
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Grok 4 - 10 New Things to Know
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Analyzing 10,000 Sales Calls With AI In 2 Weeks — Charlie Guo
AI Engineer session on Analyzing 10,000 Sales Calls With AI In 2 Weeks, presented by Charlie Guo. It adds practical context for how teams are building and operating AI systems in production.
Play video
ChatGPT is poorly designed. So I fixed it
AI Engineer session on ChatGPT is poorly designed. So I fixed it. It adds practical context for how teams are building and operating AI systems in production.
Play video
MCP: Origins and Requests For Startups — Theodora Chu, Model Context Protocol PM, Anthropic
AI Engineer session on MCP: Origins and Requests For Startups, presented by Theodora Chu, Model Context Protocol PM, Anthropic. It adds practical context for how teams are building and operating AI systems in production.
Play video
The Benchmarks Game: Why It's Rigged and How You Can (Really) Win - Darius Emrani
AI Engineer session on The Benchmarks Game: Why It's Rigged and How You Can (Really) Win - Darius Emrani. It adds practical context for how teams are building and operating AI systems in production.
Play video
The Demo I Wish I'd Had: OpenAI's Agents SDK... serverless! - Brook Riggio
AI Engineer session on The Demo I Wish I'd Had: OpenAI's Agents SDK... serverless! - Brook Riggio. It adds practical context for how teams are building and operating AI systems in production.
Play video
The Future of Qwen: A Generalist Agent Model — Junyang Lin, Alibaba Qwen
AI Engineer session on The Future of Qwen: A Generalist Agent Model, presented by Junyang Lin, Alibaba Qwen. It adds practical context for how teams are building and operating AI systems in production.
Play video
The Voice-First AI Overlay: Designing Conversational Co-Pilots - Gregory Bruss
AI Engineer session on The Voice-First AI Overlay: Designing Conversational Co-Pilots - Gregory Bruss. It adds practical context for how teams are building and operating AI systems in production.
Play video
Veo 3 for Developers — Paige Bailey, Google DeepMind
AI Engineer session on Veo 3 for Developers, presented by Paige Bailey, Google DeepMind. It adds practical context for how teams are building and operating AI systems in production.
Play video
When Will AI Models Blackmail You, and Why?
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Apple’s ‘AI Can’t Reason’ Claim Seen By 13M+, What You Need to Know
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AI Accelerates: New Gemini Model + AI Unemployment Stories Analysed
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Claude 4: Full 120 Page Breakdown … Is it the Best New Model?
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Google Takes No Prisoners Amid Torrent of AI Announcements
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AI Improves at Self-improving
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
"OpenAI is Not God” - The DeepSeek Documentary on Liang Wenfeng, R1 and What's Next
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
o3 breaks (some) records, but AI becomes pay-to-win
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
[Full Workshop from Microsoft] Github Copilot - The World's Most Widely Adopted AI Developer Tool
AI Engineer session on [Full Workshop from Microsoft] Github Copilot - The World's Most Widely Adopted AI Developer Tool. It adds practical context for how teams are building and operating AI systems in production.
Play video
Accelerate your AI journey with Azure AI model catalog: Sharmila Chokalingam
AI Engineer session on Accelerate your AI journey with Azure AI model catalog: Sharmila Chokalingam. It adds practical context for how teams are building and operating AI systems in production.
Play video
Best Practices for Evaluating Large Language Model Applications with llmeval: Niklas Nielsen
AI Engineer session on Best Practices for Evaluating Large Language Model Applications with llmeval: Niklas Nielsen. It adds practical context for how teams are building and operating AI systems in production.
Play video
Building Agents with Model Context Protocol - Full Workshop with Mahesh Murag of Anthropic
AI Engineer session on Building Agents with Model Context Protocol - Full Workshop with Mahesh Murag of Anthropic. It adds practical context for how teams are building and operating AI systems in production.
Play video
Customized, production ready inference with open source models: Dmytro (Dima) Dzhulgakov
AI Engineer session on Customized, production ready inference with open source models: Dmytro (Dima) Dzhulgakov. It adds practical context for how teams are building and operating AI systems in production.
Play video
Decoding Mistral AI's Large Language Models: Devendra Chaplot
AI Engineer session on Decoding Mistral AI's Large Language Models: Devendra Chaplot. It adds practical context for how teams are building and operating AI systems in production.
Play video
Evaluating Domain Specific LLMs for Real World Finance — Waseem Alshikh, Writer
AI Engineer session on Evaluating Domain Specific LLMs for Real World Finance, presented by Waseem Alshikh, Writer. It adds practical context for how teams are building and operating AI systems in production.
Play video
Fine tune 20 Llama Models in 5 Minutes: Santosh Radha
AI Engineer session on Fine tune 20 Llama Models in 5 Minutes: Santosh Radha. It adds practical context for how teams are building and operating AI systems in production.
Play video
Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han
AI Engineer session on Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han. It adds practical context for how teams are building and operating AI systems in production.
Play video
From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta
AI Engineer session on From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta. It adds practical context for how teams are building and operating AI systems in production.
Play video
Frontier Feud: Anthropic, Google DeepMind, Meta FAIR, Thinking Machines — Barr Yaron, Amplify
AI Engineer session on Frontier Feud: Anthropic, Google DeepMind, Meta FAIR, Thinking Machines, presented by Barr Yaron, Amplify. It adds practical context for how teams are building and operating AI systems in production.
Play video
GitHub Copilot: The World's Most Widely Adopted AI Developer Tool
AI Engineer session on GitHub Copilot: The World's Most Widely Adopted AI Developer Tool. It adds practical context for how teams are building and operating AI systems in production.
Play video
How Deep Research Works - Mukund Sridhar & Aarush Selvan, Google DeepMind
AI Engineer session on How Deep Research Works - Mukund Sridhar & Aarush Selvan, Google DeepMind. It adds practical context for how teams are building and operating AI systems in production.
Play video
How to build the world's fastest voice bot: Kwindla Hultman Kramer
AI Engineer session on How to build the world's fastest voice bot: Kwindla Hultman Kramer. It adds practical context for how teams are building and operating AI systems in production.
Play video
How to evaluate a model for your use case: Emmanuel Turlay
AI Engineer session on How to evaluate a model for your use case: Emmanuel Turlay. It adds practical context for how teams are building and operating AI systems in production.
Play video
Lessons from the Trenches: Building LLM Evals That Work IRL: Aparna Dhinkaran
AI Engineer session on Lessons from the Trenches: Building LLM Evals That Work IRL: Aparna Dhinkaran. It adds practical context for how teams are building and operating AI systems in production.
Play video
Making Open Models 10x faster and better for Modern Application Innovation: Dmytro (Dima) Dzhulgakov
AI Engineer session on Making Open Models 10x faster and better for Modern Application Innovation: Dmytro (Dima) Dzhulgakov. It adds practical context for how teams are building and operating AI systems in production.
Play video
Moondream: how does a tiny vision model slap so hard? — Vikhyat Korrapati
AI Engineer session on Moondream: how does a tiny vision model slap so hard?, presented by Vikhyat Korrapati. It adds practical context for how teams are building and operating AI systems in production.
Play video
Multi model multimodal and multi agent innovations in Azure AI: Cedric Vidal
AI Engineer session on Multi model multimodal and multi agent innovations in Azure AI: Cedric Vidal. It adds practical context for how teams are building and operating AI systems in production.
Play video
Productionizing GenAI Models — Lessons from the world's best AI teams: Lukas Biewald
AI Engineer session on Productionizing GenAI Models, presented by Lessons from the world's best AI teams: Lukas Biewald. It adds practical context for how teams are building and operating AI systems in production.
Play video
RAG and the MongoDB Document Model: Ben Flast
AI Engineer session on RAG and the MongoDB Document Model: Ben Flast. It adds practical context for how teams are building and operating AI systems in production.
Play video
State Space Models for Realtime Multimodal Intelligence: Karan Goel
AI Engineer session on State Space Models for Realtime Multimodal Intelligence: Karan Goel. It adds practical context for how teams are building and operating AI systems in production.
Play video
Stateful Agents — Full Workshop with Charles Packer of Letta and MemGPT
AI Engineer session on Stateful Agents, presented by Full Workshop with Charles Packer of Letta and MemGPT. It adds practical context for how teams are building and operating AI systems in production.
Play video
System Design for Next-Gen Frontier Models — Dylan Patel, SemiAnalysis
AI Engineer session on System Design for Next-Gen Frontier Models, presented by Dylan Patel, SemiAnalysis. It adds practical context for how teams are building and operating AI systems in production.
Play video
The Model Isn’t Wrong — You’re Just Bad at Prompting
AI Engineer session on The Model Isn’t Wrong, presented by You’re Just Bad at Prompting. It adds practical context for how teams are building and operating AI systems in production.
Play video
Unveiling the latest Gemma model advancements: Kathleen Kenealy
AI Engineer session on Unveiling the latest Gemma model advancements: Kathleen Kenealy. It adds practical context for how teams are building and operating AI systems in production.
Play video
WTF do people use Open Models for??
AI Engineer session on WTF do people use Open Models for??. It adds practical context for how teams are building and operating AI systems in production.
Play video
‘Speaking Dolphin’ to AI Data Dominance, 4.1 + Kling 2.0: 7 Updates Critically Analysed
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
o3 and o4-mini - they’re great, but easy to over-hype
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AI CEO: ‘Stock Crash Could Stop AI Progress’, Llama 4 Anti-climax + ‘Superintelligence in 2027’ ...
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Gemini 2.5 Pro - It’s a Darn Smart Chatbot … (New Simple High Score)
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Did AI Just Get Commoditized? Gemini 2.5, New DeepSeek V3, & Microsoft vs OpenAI
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
OpenAI’s New ImageGen is Unexpectedly Epic … (ft. Reve, Imagen 3, Midjourney etc)
This AI Explained video reviews a major AI development through the lens of multimodal generation and provenance. It is useful context for AI engineering, evaluation, governance, and operational risk.
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations
NIST finalizes AI 100-2e2025, providing a terminology and taxonomy for adversarial machine learning across predictive and generative AI systems.
Progress from our Frontier Red Team
Anthropic shares lessons from frontier red teaming and discusses where models are showing early-warning signs of higher-risk cyber and biology capabilities.
Play video
Manus AI - The Calm Before the Hypestorm … (vs Deep Research + Grok 3)
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT 4.5 - not so much wow
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Claude 3.7 is More Significant than its Name Implies (ft DeepSeek R2 + GPT 4.5 coming soon)
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Deep research System Card
OpenAI’s system card for deep research covers prompt injection, privacy, code execution, and external red teaming prior to release.
Play video
AGI: (gets close), Humans: ‘Who Gets to Own it?’
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Deep Research by OpenAI - The Ups and Downs vs DeepSeek R1 Search + Gemini Deep Research
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
o3-mini and the “AI War”
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Nothing Much Happens in AI, Then Everything Does All At Once
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Operator System Card
The Operator system card documents red teaming and mitigation choices for a computer-using agent, with prompt injections listed as a central risk area.
Play video
Altman Expects a ‘Fast Take-off’, ‘Super-Agent’ Debuting Soon and DeepSeek R1 Out
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
OpenAI Backtracks, Gunning for Superintelligence: Altman Brings His AGI Timeline Closer - '25 to '29
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
o3 - wow
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Never Browse Alone? Gemini 2 Live and ChatGPT Vision
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Sora is Out, But is it a Distraction?
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
o1 Pro Mode – ChatGPT Pro Full Analysis (plus o1 paper highlights)
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AI Breaks Its Silence: OpenAI’s ‘Next 12 Days’, Genie 2, and a Word of Caution
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
New Google Model Ranked ‘No. 1 LLM’, But There’s a Problem
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Leak: ‘GPT-5 exhibits diminishing returns’, Sam Altman: ‘lol’
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
ChatGPT with Search, Altman AMA
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AI - 2024AD: 212-page Report (from this morning) Fully Read w/ Highlights
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
OpenAI: ‘We Just Reached Human-level Reasoning’.
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
‘Advanced Voice’ ChatGPT Just Happened … But There's 3 Other Stories You Probably Shouldn’t Ignore
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
o1 - What is Going On? Why o1 is a 3rd Paradigm of Model + 10 Things You Might Not Know
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
ChatGPT o1 - In-Depth Analysis and Reaction (o1-preview)
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
$125B for Superintelligence? 3 Models Coming, Sutskever's Secret SSI, & Data Centers (in space)...
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Grok-2 Actually Out, But What If It Were 10,000x the Size?
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Was GPT-5 Underwhelming, Or Not? OpenAI Co-founder Exits, Figure02 Arrives, Character.AI Gutted
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Llama 405b: Full 92 page Analysis, and Uncontaminated SIMPLE Benchmark Results
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT-4o Mini Arrives In Global IT Outage, But How ‘Mini’ Is Its Intelligence?
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
How Far Can We Scale AI? Gen 3, Claude 3.5 Sonnet and AI Hype
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AI Won't Be AGI, Until It Can At Least Do This (plus 6 key ways LLMs are being upgraded)
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
‘Everything is Going to Be Robotic’ Nvidia Promises, as AI Gets More Real
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Microsoft Promises a 'Whale' for GPT-5, Anthropic Delves Inside a Model’s Mind and Altman Stumbles
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT-4o - Full Breakdown + Bonus Details
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AI Conquers Gravity: Robo-dog, Trained by GPT-4, Stays Balanced on Rolling, Deflating Yoga Ball
This AI Explained video reviews a major AI development through the lens of robotics, world models, and embodied AI. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
New OpenAI Model 'Imminent' and AI Stakes Get Raised (plus Med Gemini, GPT 2 Chatbot and Scale AI)
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD
AI Engineer session on Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD. It adds practical context for how teams are building and operating AI systems in production.
Play video
The Code AI Maturity Model and What It Means For You: Ado Kukic
AI Engineer session on The Code AI Maturity Model and What It Means For You: Ado Kukic. It adds practical context for how teams are building and operating AI systems in production.
Play video
‘Her’ AI, Almost Here? Llama 3, Vasa-1, and Altman ‘Plugging Into Everything You Want To Do’
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Udio, the Mysterious GPT Update, and Infinite Attention
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Why Does OpenAI Need a 'Stargate' Supercomputer? Ft. Perplexity CEO Aravind Srinivas
This AI Explained video reviews a major AI development through the lens of multimodal generation and provenance. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
5 Key Quotes: Altman, Huang and 'The Most Interesting Year'
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AI Agents Take the Wheel: Devin, SIMA, Figure 01 and The Future of Jobs
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
The New, Smartest AI: Claude 3 – Tested vs Gemini 1.5 + GPT-4
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
The AI 'Genie' is Out + Humanoid Robotics Step Closer
This AI Explained video reviews a major AI development through the lens of multimodal generation and provenance. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Sora - Full Analysis (with new details)
This AI Explained video reviews a major AI development through the lens of multimodal generation and provenance. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Gemini 1.5 and The Biggest Night in AI
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Gemini Ultra - Full Review
This AI Explained video reviews a major AI development through the lens of scaling and compute economics. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT-5: Everything You Need to Know So Far
This AI Explained video reviews a major AI development through the lens of model capability and AI systems in practice. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Alpha Everywhere: AlphaGeometry, AlphaCodium and the Future of LLMs
This AI Explained video reviews a major AI development through the lens of scaling and compute economics. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
OpenAI Flip-Flops and '10% Chance of Outperforming Humans in Every Task by 2027' - 3K AI Researchers
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AI On An Exponential? Data, Mamba, and More
This AI Explained video reviews a major AI development through the lens of scaling and compute economics. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Midjourney v6, Altman 'Age Reversal' and Gemini 2 - Christmas Edition
This AI Explained video reviews a major AI development through the lens of multimodal generation and provenance. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
A 100T Transformer Model Coming? Plus ByteDance Saga and the Mixtral Price Drop
This AI Explained video reviews a major AI development through the lens of multimodal generation and provenance. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Phi-2, Imagen-2, Optimus-Gen-2: Small New Models to Change the World?
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Gemini Full Breakdown + AlphaCode 2 Bombshell
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
OpenAI Insights and Training Data Shenanigans - 7 'Complicated' Developments + Guest Star
This AI Explained video reviews a major AI development through the lens of model capability and AI systems in practice. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Q* - Clues to the Puzzle?
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Are We Back to Before? OpenAI 2.0, Inflection-2 and a Major AI Cancer Breakthrough
This AI Explained video reviews a major AI development through the lens of model capability and AI systems in practice. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Altman@Microsoft, Shear@OpenAI, Chaos@Everywhere: Sutskever Regret and the Weekend That Changed AI
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Altman Out: Reasons, Reactions and the Repercussions for the Industry
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT 4 Turbo-Charged? Plus Custom GPTS, Grok, AGI Tier List, Vision Demos, Whisper V3 and more
This AI Explained video reviews a major AI development through the lens of model capability and AI systems in practice. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AI Declarations and AGI Timelines – Looking More Optimistic?
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
State of AI 2023: Highlights of 163 Page Report + Eureka Self-Improvement, MEG, Suno AI and GPT F
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Not Slowing Down: GAIA-1 to GPT Vision Tips, Nvidia B100 to Bard vs LLaVA
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
RT-X and the Dawn of Large Multimodal Models: Google Breakthrough and 160-page Report Highlights
This AI Explained video reviews a major AI development through the lens of multimodal generation and provenance. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
An Actually Big Week in AI: AutoGen, The A-Phone, Mistral 7B, GPT-Fathom and Meta Hunts CharacterAI
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
ChatGPT Fails Basic Logic but Now Has Vision, Wins at Chess and Prompts a Masterpiece
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
The New Bard and AI Images, Videos, and Translations
This AI Explained video reviews a major AI development through the lens of scaling and compute economics. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
9 AI Developments: HeyGen 2.0 to AjaxGPT, Open Interpreter to NExT-GPT and Roblox AI
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AGI Will Not Be A Chatbot - Autonomy, Acceleration, and Arguments Behind the Scenes
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
SmartGPT: Major Benchmark Broken - 89.0% on MMLU + Exam's Many Errors
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
9 New Gemini Leaks, Code Llama and A Major AI Consciousness Paper
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
AI Los Alamos? + New Realistic AI Avatars
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
11 Major AI Developments: RT-2 to '100X GPT-4'
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Llama 2: Full Breakdown
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Bad AI Predictions: Bard Upgrade, 2 Years to AI Auto-Money, OpenAI Investigation and more
This AI Explained video reviews a major AI development through the lens of robotics, world models, and embodied AI. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Time Until Superintelligence: 1-2 Years, or 20? Something Doesn't Add Up
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Phi-1: A 'Textbook' Model
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Google Gemini: AlphaGo-GPT?
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
ChatGPT's Achilles' Heel
This AI Explained video reviews a major AI development through the lens of scaling and compute economics. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Sam Altman's World Tour, in 16 Moments
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
The AI News You Might Have Missed This Week - Zuckerberg to Falcon w/ SPQR
This AI Explained video reviews a major AI development through the lens of multimodal generation and provenance. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Orca: The Model Few Saw Coming
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
The AI News You Might Have Missed This Week
This AI Explained video reviews a major AI development through the lens of robotics, world models, and embodied AI. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
'Show Your Working': ChatGPT Performance Doubled w/ Process Rewards (+Synthetic Data Event Horizon)
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Hassabis, Altman and AGI Labs Unite - AI Extinction Risk Statement [ft. Sutskever, Hinton + Voyager]
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
12 New Code Interpreter Uses (Image to 3D, Book Scans, Multiple Datasets, Error Analysis ... )
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT 4 Got Upgraded - Code Interpreter (ft. Image Editing, MP4s, 3D Plots, Data Analytics and more!)
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
'This Could Go Quite Wrong' - Altman Testimony, GPT 5 Timeline, Self-Awareness, Drones and more
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Enter PaLM 2 (New Bard): Full Breakdown - 92 Pages Read and Gemini Before GPT 5? Google I/O
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT 4 is Smarter than You Think: Introducing SmartGPT
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
What's Behind the ChatGPT History Change? How You Can Benefit + The 6 New Developments This Week
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
8 Signs It's The Future: Thought-to-Text, Nvidia Text-to-Video, Character AI, and P(Doom) @Ted
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
‘We Must Slow Down the Race’ – X AI, GPT 4 Can Now Do Science and Altman GPT 5 Statement
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT 5 Will be Released 'Incrementally' - 5 Points from Brockman Statement [plus Timelines & Safety]
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Can GPT 4 Prompt Itself? MemoryGPT, AutoGPT, Jarvis, Claude-Next [10x GPT 4!] and more...
This AI Explained video reviews a major AI development through the lens of agentic workflows and tool-use risk. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Do We Get the $100 Trillion AI Windfall? Sam Altman's Plans, Jobs & the Falling Cost of Intelligence
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
'Pause Giant AI Experiments' - Letter Breakdown w/ Research Papers, Altman, Sutskever and more
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
How Well Can GPT-4 See? And the 5 Upgrades That Are Next
This AI Explained video reviews a major AI development through the lens of scaling and compute economics. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
'Sparks of AGI' - Bombshell GPT-4 Paper: Fully Read w/ 15 Revelations
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
What's Up With Bard? 9 Examples + 6 Reasons Google Fell Behind [ft. Muse, Med-PaLM 2 and more]
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
Google Bard - The Full Review. Bard vs Bing [LaMDA vs GPT 4]
This AI Explained video reviews a major AI development through the lens of multimodal generation and provenance. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT 4: 9 Revelations (not covered elsewhere)
This AI Explained video reviews a major AI development through the lens of governance and responsible deployment. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT 4: Full Breakdown (14 Details You May Have Missed)
This AI Explained video reviews a major AI development through the lens of AI safety and model behavior. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT 5 is All About Data
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
8 New Ways to Use Bing's Upgraded 8 [now 20] Message Limit (ft. pdfs, quizzes, tables, scenarios...)
This AI Explained video reviews a major AI development through the lens of model capability and AI systems in practice. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
9 of the Best Bing (GPT 4) Prompts
This AI Explained video reviews a major AI development through the lens of model capability and AI systems in practice. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
8 Ways ChatGPT 4 [Is] Better Than ChatGPT
This AI Explained video reviews a major AI development through the lens of benchmarks and evaluation evidence. It is useful context for AI engineering, evaluation, governance, and operational risk.
Play video
GPT 4 - hype vs reality
This AI Explained video reviews a major AI development through the lens of scaling and compute economics. It is useful context for AI engineering, evaluation, governance, and operational risk.