WolfBench (2026-03-10)

Wolfram Ravenwolf’s Four-Metric Framework · based on Terminal-Bench 2.0

One score is not enough.
Because performance is a distribution, not a point.

Most benchmarks report just a single average. WolfBench shows four metrics: the rock-solid base you can always count on, the average you can expect, the best a single run achieved, and the ceiling of what’s theoretically possible. The spread between them tells you how consistent – or how unpredictable – an AI agent really is.
Learn more ↓

%
T2 = Terminus-2CC = Claude CodeOC = OpenClaw
▲ Ceiling (ever solved)★ Best-of (peak run)∅ Average (mean score)■ Solid (always solved)
👁
Claude Opus 4.6Claude Sonnet 4.6Kimi K2.5gpt-5.4
0%10%20%30%40%50%60%70%80%90%100%
Claude Opus 4.6
T2
2.0.0
55% 71% 75% 84%
5R@1h
T2
2.0.0
55% 73% 75% 88%
5R@2h
CC
2.1.63
45% 63% 67% 81%
5R@1h
CC
2.1.63
46% 64% 69% 80%
5R@2h
OC
2026.2.17
33% 51% 56% 64%
5R@2h
OC
2026.3.1
42% 58% 58% 74%
5R@1h
Claude Sonnet 4.6
T2
2.0.0
42% 62% 64% 81%
5R@1h
T2
2.0.0
42% 61% 63% 81%
5R@2h
CC
2.1.63
40% 52% 54% 67%
5R
CC
2.1.63
40% 58% 63% 75%
5R@1h
CC
2.1.66
43% 57% 60% 70%
3R@2h
CC
2.1.68
47% 54% 55% 61%
2R@2h
OC
2026.3.1
36% 53% 56% 70%
5R@1h
OC
2026.3.1
37% 53% 58% 69%
5R@2h
Kimi K2.5
T2
2.0.0
31% 48% 52% 63%
5R@1h
T2
2.0.0
28% 49% 52% 65%
5R@2h
OC
2026.2.17
10% 32% 35% 57%
5R@2h
OC
2026.3.1
13% 39% 45% 58%
5R@1h
gpt-5.4
T2
2.0.0
28% 44% 47% 61%
5R@1h
OC
2026.3.1
9% 30% 33% 52%
5R@1h
OC
2026.3.1
11% 30% 35% 53%
5R@2h
Run Details (100 runs)
DateAgentProviderVendorModelThinkScorePassFailTimeoutT/OErrDurationInOut
2026-03-09 07:46OpenClaw (2026.3.1)openaiopenaigpt-5.4-28.1%25643600s701h04m29.5M288K
2026-03-09 07:00Terminus-2 (2.0.0)anthropicanthropicclaude-opus-4-6-69.7%62263600s811h08m146.4M1.6M
2026-03-09 06:38OpenClaw (2026.3.1)openaiopenaigpt-5.4-32.6%29603600s1101h07m35.7M345K
2026-03-09 05:53Terminus-2 (2.0.0)anthropicanthropicclaude-opus-4-6-71.9%64253600s501h06m151.5M1.5M
2026-03-09 05:33OpenClaw (2026.3.1)openaiopenaigpt-5.4-31.5%28613600s701h04m34.4M326K
2026-03-09 04:27OpenClaw (2026.3.1)openaiopenaigpt-5.4-30.3%27623600s701h06m42.8M322K
2026-03-09 04:13Terminus-2 (2.0.0)anthropicanthropicclaude-opus-4-6-70.8%63253600s811h40m175.6M1.7M
2026-03-09 03:21OpenClaw (2026.3.1)openaiopenaigpt-5.4-29.2%26633600s501h05m29.6M333K
2026-03-09 03:07Terminus-2 (2.0.0)anthropicanthropicclaude-opus-4-6-68.5%61283600s401h05m153.0M1.4M
2026-03-08 19:15Terminus-2 (2.0.0)wandbmoonshotaiKimi-K2.5-51.7%46423600s1411h26m204.5M1.7M
2026-03-08 17:46Terminus-2 (2.0.0)wandbmoonshotaiKimi-K2.5-48.3%43463600s1201h29m193.4M1.7M
2026-03-08 16:04Terminus-2 (2.0.0)wandbmoonshotaiKimi-K2.5-46.1%41483600s1401h41m236.0M1.7M
2026-03-08 14:26Terminus-2 (2.0.0)anthropicanthropicclaude-opus-4-6-75.3%67223600s301h03m155.2M1.4M
2026-03-08 13:51Terminus-2 (2.0.0)wandbmoonshotaiKimi-K2.5-49.4%44443600s1312h12m195.4M1.7M
2026-03-08 12:24Terminus-2 (2.0.0)wandbmoonshotaiKimi-K2.5-46.1%41483600s1501h26m197.7M1.7M
2026-03-08 12:12Terminus-2 (2.0.0)anthropicanthropicclaude-sonnet-4-6-61.8%55343600s1001h07m259.5M2.2M
2026-03-08 10:25Terminus-2 (2.0.0)anthropicanthropicclaude-sonnet-4-6-59.6%53363600s701h46m216.2M1.9M
2026-03-08 09:53OpenClaw (2026.3.1)wandbmoonshotaiKimi-K2.5-37.1%33553600s1312h30m192.2M1.6M
2026-03-08 09:18Terminus-2 (2.0.0)anthropicanthropicclaude-sonnet-4-6-62.9%56333600s1301h06m192.5M2.0M
2026-03-08 08:31OpenClaw (2026.3.1)wandbmoonshotaiKimi-K2.5-39.3%35533600s1311h21m188.4M1.6M
2026-03-08 08:10Terminus-2 (2.0.0)anthropicanthropicclaude-sonnet-4-6-62.9%56333600s601h08m189.4M1.9M
2026-03-08 07:14OpenClaw (2026.3.1)wandbmoonshotaiKimi-K2.5-34.8%31583600s701h16m228.8M1.6M
2026-03-08 06:21Terminus-2 (2.0.0)anthropicanthropicclaude-sonnet-4-6-64.0%57323600s801h48m151.2M1.9M
2026-03-08 05:38OpenClaw (2026.3.1)wandbmoonshotaiKimi-K2.5-44.9%40493600s1101h35m176.6M1.4M
2026-03-08 05:12OpenClaw (2026.3.1)anthropicanthropicclaude-opus-4-6-57.3%51383600s1001h09m97.6M1.4M
2026-03-08 04:09OpenClaw (2026.3.1)wandbmoonshotaiKimi-K2.5-38.2%34543600s611h29m171.4M1.6M
2026-03-08 04:03OpenClaw (2026.3.1)anthropicanthropicclaude-opus-4-6-57.3%51383600s701h08m78.6M1.3M
2026-03-08 02:53OpenClaw (2026.3.1)anthropicanthropicclaude-opus-4-6-56.2%50393600s1001h10m74.2M1.3M
2026-03-08 01:44OpenClaw (2026.3.1)anthropicanthropicclaude-opus-4-6-58.4%52373600s801h08m73.8M1.3M
2026-03-08 00:37OpenClaw (2026.3.1)anthropicanthropicclaude-opus-4-6-58.4%52373600s501h07m83.4M1.3M
2026-03-08 00:33Terminus-2 (2.0.0)openaiopenaigpt-5.4-44.9%40483600s1212h06m726.8M1.0M
2026-03-07 23:28OpenClaw (2026.3.1)anthropicanthropicclaude-sonnet-4-6-51.7%46413600s321h08m95.3M2.1M
2026-03-07 23:25Terminus-2 (2.0.0)openaiopenaigpt-5.4-43.8%39503600s1401h08m667.0M905K
2026-03-07 22:16Terminus-2 (2.0.0)openaiopenaigpt-5.4-42.7%38513600s1201h08m707.0M878K
2026-03-07 22:07OpenClaw (2026.3.1)anthropicanthropicclaude-sonnet-4-6-55.1%49403600s501h20m86.8M2.0M
2026-03-07 20:58OpenClaw (2026.3.1)anthropicanthropicclaude-sonnet-4-6-56.2%50393600s301h09m78.6M2.0M
2026-03-07 20:10Terminus-2 (2.0.0)openaiopenaigpt-5.4-41.6%37513600s1512h05m759.6M939K
2026-03-07 19:47OpenClaw (2026.3.1)anthropicanthropicclaude-sonnet-4-6-51.7%46433600s201h10m71.6M2.0M
2026-03-07 18:57Terminus-2 (2.0.0)openaiopenaigpt-5.4-47.2%42463600s1411h12m775.5M982K
2026-03-07 18:15OpenClaw (2026.3.1)anthropicanthropicclaude-sonnet-4-6-48.3%43463600s601h31m115.2M2.3M
2026-03-07 16:53Claude Code (2.1.63)anthropicanthropicclaude-opus-4-6-67.4%60283600s611h22m222.3M1.2M
2026-03-07 15:47Claude Code (2.1.63)anthropicanthropicclaude-opus-4-6-62.9%56333600s401h05m195.9M1.6M
2026-03-07 14:27Claude Code (2.1.63)anthropicanthropicclaude-opus-4-6-58.4%52363600s611h20m169.0M1.2M
2026-03-07 13:18Claude Code (2.1.63)anthropicanthropicclaude-opus-4-6-59.6%53363600s701h09m188.9M1.2M
2026-03-07 11:53Claude Code (2.1.63)anthropicanthropicclaude-opus-4-6-67.4%60293600s501h24m209.0M1.4M
2026-03-07 10:13Claude Code (2.1.63)anthropicanthropicclaude-sonnet-4-6-53.9%48413600s1201h39m202.1M2.1M
2026-03-07 09:08Claude Code (2.1.63)anthropicanthropicclaude-sonnet-4-6-57.3%51383600s401h05m166.9M1.7M
2026-03-07 08:02Claude Code (2.1.63)anthropicanthropicclaude-sonnet-4-6-62.9%56333600s301h05m185.9M1.8M
2026-03-07 06:56Claude Code (2.1.63)anthropicanthropicclaude-sonnet-4-6-57.3%51383600s601h05m210.7M2.2M
2026-03-07 04:37Claude Code (2.1.63)anthropicanthropicclaude-sonnet-4-6-56.2%50383600s512h19m216.0M2.3M
2026-03-06 03:38OpenClaw (2026.3.1)openrouteropenaigpt-5.4-22.5%20697200s302h03m32.2M293K
2026-03-06 01:22OpenClaw (2026.3.1)openrouteropenaigpt-5.4-34.8%31587200s902h16m32.0M289K
2026-03-05 23:18OpenClaw (2026.3.1)openrouteropenaigpt-5.4-28.1%25647200s702h04m37.2M299K
2026-03-05 21:12OpenClaw (2026.3.1)openrouteropenaigpt-5.4-34.8%31587200s602h06m34.4M300K
2026-03-05 19:07OpenClaw (2026.3.1)openrouteropenaigpt-5.4-28.1%25647200s902h04m31.5M301K
2026-03-04 12:30Claude Code (2.1.68)anthropicanthropicclaude-sonnet-4-6-55.1%49407200s302h03m364.3M3.1M
2026-03-04 10:16Claude Code (2.1.68)anthropicanthropicclaude-sonnet-4-6-52.8%47407200s322h13m411.8M4.1M
2026-03-04 08:10Claude Code (2.1.66)anthropicanthropicclaude-sonnet-4-6-58.4%52367200s212h05m428.8M4.1M
2026-03-04 06:06Claude Code (2.1.66)anthropicanthropicclaude-sonnet-4-6-59.6%53357200s312h03m347.5M4.1M
2026-03-04 03:44Claude Code (2.1.66)anthropicanthropicclaude-sonnet-4-6-53.9%48407200s212h21m361.1M4.0M
2026-03-04 00:57Claude Code (2.1.63)anthropicanthropicclaude-opus-4-6-66.3%59297200s312h10m475.9M3.5M
2026-03-03 22:18Claude Code (2.1.63)anthropicanthropicclaude-opus-4-6-58.4%52367200s312h04m314.3M2.5M
2026-03-03 18:30Claude Code (2.1.63)anthropicanthropicclaude-opus-4-6-62.9%56317200s122h03m425.4M3.4M
2026-03-03 15:54Claude Code (2.1.63)anthropicanthropicclaude-opus-4-6-68.5%61277200s312h05m404.9M3.1M
2026-03-03 12:25Claude Code (2.1.63)anthropicanthropicclaude-opus-4-6-61.8%55327200s422h08m358.8M2.1M
2026-03-03 02:15OpenClaw (2026.3.1)anthropicanthropicclaude-sonnet-4-6-50.6%45447200s102h06m78.0M1.7M
2026-03-02 23:43OpenClaw (2026.3.1)anthropicanthropicclaude-sonnet-4-6-51.7%46427200s312h18m91.0M2.2M
2026-03-02 18:37OpenClaw (2026.3.1)anthropicanthropicclaude-sonnet-4-6-52.8%47427200s202h11m97.2M2.3M
2026-03-02 15:22OpenClaw (2026.3.1)anthropicanthropicclaude-sonnet-4-6-58.4%52377200s402h12m111.8M2.2M
2026-03-02 13:05OpenClaw (2026.3.1)anthropicanthropicclaude-sonnet-4-6-52.8%47427200s302h10m81.5M2.0M
2026-03-02 12:32Terminus-2 (2.0.0)wandbmoonshotaiKimi-K2.5-51.7%46417200s823h51m398.8M1.9M
2026-03-02 08:42Terminus-2 (2.0.0)wandbmoonshotaiKimi-K2.5-50.6%45427200s323h49m306.0M1.6M
2026-03-02 07:46Claude Code (2.1.63)anthropicanthropicclaude-sonnet-4-6-50.6%4542-2121h48m186.8M2.5M
2026-03-02 06:37Terminus-2 (2.0.0)wandbmoonshotaiKimi-K2.5-44.9%40477200s222h05m341.2M1.8M
2026-03-02 05:59Claude Code (2.1.63)anthropicanthropicclaude-sonnet-4-6-50.6%4541-1931h46m198.0M2.1M
2026-03-02 04:30Claude Code (2.1.63)anthropicanthropicclaude-sonnet-4-6-53.9%4839-1621h28m219.7M2.5M
2026-03-02 04:28Terminus-2 (2.0.0)wandbmoonshotaiKimi-K2.5-51.7%46427200s612h08m314.7M1.7M
2026-03-02 03:01Claude Code (2.1.63)anthropicanthropicclaude-sonnet-4-6-53.9%4839-1621h28m232.3M2.2M
2026-03-02 01:57Claude Code (2.1.63)anthropicanthropicclaude-sonnet-4-6-51.7%4642-1911h03m188.2M2.5M
2026-03-02 01:51Terminus-2 (2.0.0)wandbmoonshotaiKimi-K2.5-44.9%40487200s412h37m330.8M1.9M
2026-03-01 22:50OpenClaw (2026.2.17)wandbmoonshotaiKimi-K2.5-33.7%30587200s412h15m91.5M1.1M
2026-03-01 20:38OpenClaw (2026.2.17)wandbmoonshotaiKimi-K2.5-29.2%26637200s202h11m84.4M1.1M
2026-03-01 18:26OpenClaw (2026.2.17)wandbmoonshotaiKimi-K2.5-34.8%31567200s422h12m83.5M1.2M
2026-03-01 16:11OpenClaw (2026.2.17)wandbmoonshotaiKimi-K2.5-29.2%26637200s302h14m99.5M1.1M
2026-03-01 13:43OpenClaw (2026.2.17)wandbmoonshotaiKimi-K2.5-31.5%28617200s302h27m97.2M1.1M
2026-02-28 07:31Terminus-2 (2.0.0)anthropicanthropicclaude-sonnet-4-6-62.9%56337200s702h09m246.3M2.1M
2026-02-28 06:08Terminus-2 (2.0.0)anthropicanthropicclaude-opus-4-6-75.3%67227200s202h03m186.1M1.8M
2026-02-28 04:06Terminus-2 (2.0.0)anthropicanthropicclaude-sonnet-4-6-56.2%50397200s1003h24m212.7M2.1M
2026-02-28 04:03Terminus-2 (2.0.0)anthropicanthropicclaude-opus-4-6-68.5%61287200s402h05m219.1M1.8M
2026-02-28 00:41Terminus-2 (2.0.0)anthropicanthropicclaude-opus-4-6-74.2%66237200s202h24m237.4M1.6M
2026-02-27 22:33Terminus-2 (2.0.0)anthropicanthropicclaude-opus-4-6-71.9%64257200s102h08m230.0M1.8M
2026-02-27 20:29Terminus-2 (2.0.0)anthropicanthropicclaude-opus-4-6-73.0%65247200s102h03m163.1M1.4M
2026-02-27 11:54OpenClaw (2026.2.17)anthropicanthropicclaude-opus-4-6-51.7%46437200s403h33m81.2M1.1M
2026-02-27 09:47OpenClaw (2026.2.17)anthropicanthropicclaude-opus-4-6-47.2%42477200s302h06m80.9M915K
2026-02-27 08:53Terminus-2 (2.0.0)anthropicanthropicclaude-sonnet-4-6-61.8%55347200s702h06m204.6M2.1M
2026-02-27 07:32OpenClaw (2026.2.17)anthropicanthropicclaude-opus-4-6-56.2%50397200s402h14m63.8M831K
2026-02-27 05:25OpenClaw (2026.2.17)anthropicanthropicclaude-opus-4-6-52.8%47417200s112h07m91.4M983K
2026-02-27 05:19Terminus-2 (2.0.0)anthropicanthropicclaude-sonnet-4-6-61.8%55347200s703h34m234.5M2.0M
2026-02-27 03:17OpenClaw (2026.2.17)anthropicanthropicclaude-opus-4-6-47.2%42477200s202h07m88.5M1.1M
2026-02-27 03:12Terminus-2 (2.0.0)anthropicanthropicclaude-sonnet-4-6-62.9%56337200s502h06m199.3M1.9M

The impact of time(outs) on agentic benchmarks

Wolfram Ravenwolfby Wolfram Ravenwolf – who evaluates models for breakfast, builds agents at night, and preaches AI usefulness all day long.

In this post, we’ll look at why timeouts are necessary and why I locked WolfBench at 1 hour.

When we launched WolfBench, a four-metric evaluation framework built on Terminal-Bench 2.0, we started with a fixed 2-hour timeout per task. After analyzing nearly 10,000 task results across 8 models, I’m changing that to 1 hour. Here’s why—and why it matters for anyone who cares about meaningful AI benchmarks.

Why agentic benchmarks need timeouts

Terminal-Bench 2.0 is a real-world agentic benchmark: 89 tasks spanning system administration, DevOps, security, data/ML operations, and problem-solving. Models must plan, execute shell commands, inspect results, debug failures, and iterate—like a real technical professional. No multiple-choice tricks. No toy puzzles. Just real work in sandboxed environments.

But when a model is free to act, it can get stuck in a loop, retrying the same failing approach endlessly, burning tokens and compute for nothing. Timeouts exist to prevent that.

The problem with per-task timeouts

By default, Terminal-Bench 2.0 assigns each task its own timeout, ranging from 10 minutes to over 3 hours. The majority—45 out of 89 tasks—get just 15 minutes.

This severely penalizes slower models and endpoints. Even if a model could solve a task, a temporarily overloaded API or a slightly slower inference speed can prevent it from finishing in time. When you’re evaluating model capability, you don’t want to measure endpoint performance.

That’s why some recent benchmarks (e.g., MiniMax-M2.5, GLM-5) use fixed 2-hour timeouts for every task. Some, like Kimi K2.5, even disable reasoning/thinking—because models that invest in up-front planning over immediate execution can actually score worse in time-constrained agentic benchmarks. (This is also why you always need to read the fine print: two benchmarks with the same name can produce very different results depending on how they were configured.)

So why not just use 2 hours?

Because the data says 1 hour is the sweet spot.

I analyzed 9,636 task results across all models and configurations in my evaluation runs: every pass, every fail, every actual duration. Then I asked a simple question: At each possible timeout cap, how many successful task completions would we lose?

Timeout CapPasses Lost% of Total
15 min846 / 4,70818.0% – way too tight
30 min395 / 4,7088.4% – still too aggressive
60 min136 / 4,7082.9% – the sweet spot
90 min76 / 4,7081.6% – diminishing returns
120 min1 / 4,7080.0% – almost nothing gained

Going from 60 to 120 minutes saves just 135 additional passes out of nearly 5,000. That’s a 2.9% gain—in exchange for doubling the maximum time an agent can spend looping on a hopeless task.

A model that can solve a task typically does so well within 60 minutes. A model that’s still looping after an hour is usually not going to find the answer in hour two—it’s just going to burn more tokens failing the same way.

The cost of generous timeouts

And “looping on hopeless tasks” isn’t hypothetical. Across all evaluated runs, I measured 921.9 hours of wasted compute—time spent on attempts that ran past their default timeout and still failed. The worst offenders burned 40–60 hours each, producing nothing.

Longer timeouts don’t just waste time. They waste tokens, inflate costs, and—in extreme cases—can fill up disk space when models generate output in tight failure loops. The 2-hour cap doubles the blast radius of every one of these failure modes compared to 1 hour.

What 1 hour actually loses

The 136 passes lost at a 60-minute cap come primarily from just a handful of tasks: mailman (22 lost passes), path-tracing (10), and rstan-to-pystan (9). These are tasks with shorter default timeouts that occasionally need over an hour to succeed. Meanwhile, the task with the longest default timeout in the entire benchmark—build-pov-ray at 200 minutes—has never needed more than 34 minutes to succeed. And sam-cell-seg, with its 120-minute default? It has a 0% pass rate across all attempts. Extra time doesn’t help what can’t be solved.

The verdict

A fixed 1-hour timeout per task is the optimal configuration for WolfBench. It:

For a benchmark designed to measure what models can do—not how performant the infrastructure is—that’s the right call.

WolfBench is now locked at 1 hour per task. All future evaluations will use this configuration, and I’ve already begun re-running existing models under the new timeout for consistency.

Analysis based on 9,636 task results across 6 models (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Kimi K2.5, MiniMax-M2.5, GLM-5).
Inference sponsored by CoreWeave. Sandbox compute by Daytona. Built with Harbor, Terminal-Bench 2.0, and W&B Weave.