Matthew Siegel

27 articles

September 12, 2025

TutorBench: Grading the Next Generation of AI Tutors

Can an AI be a great tutor? TutorBench is a new, challenging benchmark from Scale designed to find out. Moving beyond right or wrong answers, it grades today's leading AI models on their ability to actually teach: evaluating crucial skills like adaptive explanation, constructive feedback, and active learning support. Using 1,500 multimodal conversations across STEM subjects, many including images of handwritten work, TutorBench reveals that even the most advanced models still have a long way to go to master the nuanced art of tutoring, paving the way for the next generation of AI in education.

September 2, 2025

Research

Using Rubrics to Build Better Models

How do you know if an AI model is actually learning, or just getting better at faking it? A new paper from researchers at Scale introduces Rubrics as Rewards (RaR), a framework that solves this problem by training models with structured, expert-designed checklists instead of simple preference scores. This approach moves the human role from a simple preference labeler to an expert architect of the AI's values, resulting in up to a 28% performance leap on challenging benchmarks and providing a more transparent, effective path toward reliable AI.

August 19, 2025

Engineering

AI Doesn’t Live in Text Alone

AI is moving beyond text, toward agents that can listen, speak, and interact naturally with the world. Voice AI requires far more than word; it demands the nuanced tones, emotions, and dynamics of human speech. But unlike text, there’s no vast public library of labeled audio to train on. Scale is building that foundation, delivering high-quality, diverse, and emotionally rich speech data to power every stage of model development. From real-time conversation to multimodal perception, these datasets are unlocking the next era of human-computer interaction. The future is listening.

August 4, 2025

Research

New Benchmarks Envision the Future of AI in Healthcare

A fundamental shift is underway in how AI for healthcare is evaluated. Recent studies from OpenAI, Google, and Microsoft move beyond simple accuracy scores to establish a new standard for measuring an AI's healthcare skills. This post provides an analysis of three distinct evaluation methodologies that redefine what "good" looks like for clinical AI. We explore how OpenAI's HealthBench uses a massive, rubric-based system to measure foundational safety; how Google's AMIE tests the nuanced, "soft skills" of an interactive diagnostic dialogue; and how Microsoft's SDBench validates an agent's ability to make strategic, cost-conscious decisions. By examining these benchmarks and their results, we provide a glance at the future of AI in healthcare.

August 4, 2025

Research

The AI Risk Matrix: Evolving AI Safety and Security for Today

The shift from reactive models to agentic systems fundamentally alters the AI risk landscape, making frameworks that focus only on user intent and model output incomplete. To address this gap, we've evolved the AI Risk Matrix by adding a crucial third dimension: Model Agency. This article breaks down agency into three tiers—Tools, Agents, and Collectives—using concrete examples to illustrate how complex failures can now emerge from the system itself. We argue that this systemic view, which connects model behavior to traditional AppSec vulnerabilities, is essential for building the next generation of safe and reliable AI.

July 23, 2025

Research

The Future is Multilingual: Scale's New Evaluation Benchmark

Building truly intelligent and equitable multilingual AI requires a new way to measure cultural reasoning. Scale's new Multilingual Native Reasoning Challenge (MultiNRC) is designed to do just that. Created from scratch by native speakers, this benchmark tests for deep linguistic and cultural understanding beyond simple translation, providing a clear path for the AI community to accelerate progress.

July 23, 2025

Research

WebGuard: A Guardrail for the Agentic Age

As AI agents become more powerful, ensuring their safety is the most critical challenge for deployment. This post explores WebGuard, a new benchmark from researchers at Scale, UC Berkeley, and The Ohio State University that reveals a significant safety gap in current models. Learn how high-quality, human-in-the-loop data provides a path forward, dramatically improving a model's ability to avoid risky behavior.

July 16, 2025

Research

Do AI Tools Slow Down Developers? The Answer Isn't Simple.

A recent viral study from METR challenged a core assumption of the AI era: that AI tools make developers more productive. It found that expert developers were actually 19% slower when using them, even though they felt 20% faster. In this post, we break down the likely reasons for this surprising slowdown and argue the study's focus on expert speed leads us to consider a more profound story: AI's true value may be in empowering a new generation of vibe coders and managers to build things that otherwise would never have existed. Ultimately, the question isn't just whether AI makes us faster. It's about how we measure value in an era where enjoyment, context, and empowerment are becoming just as important as the clock.

July 9, 2025

Research

Beyond the Black Box: Teaching Models to Verbalize Reward Hacking

One of AI's biggest challenges is "reward hacking," where models learn to game the system for a correct answer instead of actually reasoning. This hidden deception makes AI untrustworthy. Scale research has found a powerful solution: instead of stopping the hacking, get the model to admit to it in its Chain-of-Thought reasoning. This new paper details how Verbalization Fine-Tuning (VFT) trains models to announce their shortcuts, dramatically increasing transparency from 11% to 94% and making AI systems fundamentally safer.

July 8, 2025

Research

Detecting and Evaluating Agent Sabotage

A new research collaboration led by a MATS scholar and advised by a team of researchers from Anthropic, Scale, and other research institutes introduces SHADE-Arena, a benchmark for detecting and evaluating subtle sabotage by AI agents. Within 17 complex scenarios, advanced models were tasked with completing a primary goal while secretly pursuing a harmful objective, all under the watch of an AI monitor. The results show that even top models like Claude 3.7 Sonnet and Gemini 2.5 Pro rarely succeed at this deception, often making simple errors. However, the study also reveals that monitors are not yet reliable enough for safety-critical systems and that an agent's private "scratchpad" is a key vulnerability. This work establishes a vital baseline for tracking and defending against agentic risks as AI capabilities evolve.