Blog

Company Updates & Technology Articles

October 16, 2025

Agentic infra is the problem you're probably not thinking about | Human in the Loop: Episode 14

Today on the podcast, the team is talking about the latest with enterprise agents including the problem you're probably not thinking about but should: agentic infrastructure.

October 16, 2025

People

Why I Joined Scale: Answering the AI Revolution's Biggest Challenge

The world is buzzing about the potential of artificial intelligence. We've seen groups proclaiming a "$10 Trillion AI Revolution" and comparing its impact to the Industrial Revolution. In the last 6-12 months, we've watched as true Generative AI solutions have begun to move from the lab and into the enterprise.

October 16, 2025

Research

VisualToolBench: Testing the Limits of AI Vision

Our new benchmark, VisualToolBench, reveals a striking limitation in today's most advanced AI: models are much better at "thinking about images" than "thinking with them." While AI can describe what's in a picture, it fails when asked to manipulate an image by cropping, editing, or enhancing it to solve complex, real-world problems. The results are stark, with no model scoring above 19% correctness. Dive into our breakdown of why even the best models fail, what this reveals about the core bottleneck of visual perception, and how these findings create a new roadmap for the future of AI.

October 7, 2025

Research

Enterprise Reinforcement Learning with Rubrics as Rewards

Many enterprise problems lack simple yes/no solutions, causing common AI training methods to fall short. Scale’s Rubrics as Rewards (RaR) method solves this by using a detailed, multi-faceted rubric for evaluation instead of a simple reward signal. This approach enables smaller, fine-tuned models to match or outperform much larger, general-purpose models on specialized tasks. For instance, on a legal analysis test set, a small Qwen3-4B model trained with RaR surpassed the performance of the much larger GPT-4.1. For enterprises, this translates directly to lower costs, more transparency, and tighter control, delivering superior performance on the complex workflows that matter most.

September 24, 2025

Product

Expanding Our Data Engine for Physical AI

Scale’s Data Engine for Physical AI is a comprehensive data collection and annotation solution that provides the massive, high-quality datasets robotics companies need to train foundation models.

September 22, 2025

Research

Introducing SEAL Showdown: Real People, Real Conversations, Real Rankings

SEAL Showdown is a new public AI leaderboard from Scale that evaluates large language models based on real-world user preferences rather than synthetic tests or hobbyist feedback. Unlike existing leaderboards, it captures granular insights by demographics, regions, professions, and use cases, drawing on millions of conversations from a diverse global contributor base. Designed to be trustworthy and resistant to gaming, SEAL Showdown sets a new standard for model evaluation by showing how AI performs for people like you.

September 19, 2025

Research

SWE-Bench Pro: Raising the Bar for Agentic Coding

Benchmarks play a critical role in measuring the progress of AI coding agents, but most fall short by relying on contaminated training data, oversimplified bug fixes, or narrow task coverage. SWE-Bench Pro solves these problems with contamination-resistant repositories, diverse and industrially relevant codebases, and human-in-the-loop curation that preserves real-world difficulty. With reproducible, end-to-end evaluation, SWE-Bench Pro sets a new gold standard for testing advanced AI developers.

September 19, 2025

Research

Advancing Agents: Introducing Scale’s Agentic Leaderboards

While today's agents show promise, the benchmarks used to evaluate them often test simple, isolated skills that don't reflect real-world work. To close this gap, Scale is launching a new suite of evaluations designed to measure an agent's ability to perform complex, end-to-end tasks. Our first two leaderboards set a new, more difficult standard for the industry. SWE-Bench Pro challenges agents with professional software engineering tasks in complex, proprietary codebases they've never seen before. MCP Atlas measures an agent's ability to skillfully orchestrate over 300 real-world digital tools to solve a single problem. Read the full post to learn about our framework for building a more reliable yardstick for the future of AI.

September 19, 2025

Research

Actions, Not Words: MCP-Atlas Raises the Bar for Agentic Evaluation

MCP-Atlas is a real-world leaderboard for agentic tool use via the Model Context Protocol. It runs 1,000 single-turn tasks across 40+ servers and 300+ tools—search, databases, filesystems, APIs, and dev tools—each requiring 3–6 calls with distractors. We score exact-answer pass rate and provide diagnostics. Early results: even the top model completes less than half of tasks, with failures concentrated in tool selection, parameter construction, and orchestration. Built for model and product teams, MCP-Atlas pinpoints what to fix.

September 19, 2025

Government

Investing in Britain's AI Talent

The future of artificial intelligence will be shaped by those who build it together. Right now, there is no more important partnership in technology than the one being forged between the United States and the United Kingdom. This transatlantic alliance, strengthened by a historic bilateral technology agreement, is creating a center of gravity for AI innovation, and at Scale AI, we are proud to be at the heart of it.