Wolfram Ravenwolf’s Four-Metric Framework · based on Terminal-Bench 2.0
Most benchmarks report just a single average. WolfBench shows four metrics: the rock-solid base you can always count on, the average you can expect, the best a single run achieved, and the ceiling of what’s theoretically possible. The spread between them tells you how consistent – or how unpredictable – an AI agent really is.
Learn more ↓
| Date | Agent | Provider | Vendor | Model | Think | Score | Pass | Fail | Timeout | T/O | Err | Duration | In | Out |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2026-03-09 07:46 | OpenClaw (2026.3.1) | openai | openai | gpt-5.4 | - | 28.1% | 25 | 64 | 3600s | 7 | 0 | 1h04m | 29.5M | 288K |
| 2026-03-09 07:00 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-opus-4-6 | - | 69.7% | 62 | 26 | 3600s | 8 | 1 | 1h08m | 146.4M | 1.6M |
| 2026-03-09 06:38 | OpenClaw (2026.3.1) | openai | openai | gpt-5.4 | - | 32.6% | 29 | 60 | 3600s | 11 | 0 | 1h07m | 35.7M | 345K |
| 2026-03-09 05:53 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-opus-4-6 | - | 71.9% | 64 | 25 | 3600s | 5 | 0 | 1h06m | 151.5M | 1.5M |
| 2026-03-09 05:33 | OpenClaw (2026.3.1) | openai | openai | gpt-5.4 | - | 31.5% | 28 | 61 | 3600s | 7 | 0 | 1h04m | 34.4M | 326K |
| 2026-03-09 04:27 | OpenClaw (2026.3.1) | openai | openai | gpt-5.4 | - | 30.3% | 27 | 62 | 3600s | 7 | 0 | 1h06m | 42.8M | 322K |
| 2026-03-09 04:13 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-opus-4-6 | - | 70.8% | 63 | 25 | 3600s | 8 | 1 | 1h40m | 175.6M | 1.7M |
| 2026-03-09 03:21 | OpenClaw (2026.3.1) | openai | openai | gpt-5.4 | - | 29.2% | 26 | 63 | 3600s | 5 | 0 | 1h05m | 29.6M | 333K |
| 2026-03-09 03:07 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-opus-4-6 | - | 68.5% | 61 | 28 | 3600s | 4 | 0 | 1h05m | 153.0M | 1.4M |
| 2026-03-08 19:15 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi-K2.5 | - | 51.7% | 46 | 42 | 3600s | 14 | 1 | 1h26m | 204.5M | 1.7M |
| 2026-03-08 17:46 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi-K2.5 | - | 48.3% | 43 | 46 | 3600s | 12 | 0 | 1h29m | 193.4M | 1.7M |
| 2026-03-08 16:04 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi-K2.5 | - | 46.1% | 41 | 48 | 3600s | 14 | 0 | 1h41m | 236.0M | 1.7M |
| 2026-03-08 14:26 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-opus-4-6 | - | 75.3% | 67 | 22 | 3600s | 3 | 0 | 1h03m | 155.2M | 1.4M |
| 2026-03-08 13:51 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi-K2.5 | - | 49.4% | 44 | 44 | 3600s | 13 | 1 | 2h12m | 195.4M | 1.7M |
| 2026-03-08 12:24 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi-K2.5 | - | 46.1% | 41 | 48 | 3600s | 15 | 0 | 1h26m | 197.7M | 1.7M |
| 2026-03-08 12:12 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-sonnet-4-6 | - | 61.8% | 55 | 34 | 3600s | 10 | 0 | 1h07m | 259.5M | 2.2M |
| 2026-03-08 10:25 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-sonnet-4-6 | - | 59.6% | 53 | 36 | 3600s | 7 | 0 | 1h46m | 216.2M | 1.9M |
| 2026-03-08 09:53 | OpenClaw (2026.3.1) | wandb | moonshotai | Kimi-K2.5 | - | 37.1% | 33 | 55 | 3600s | 13 | 1 | 2h30m | 192.2M | 1.6M |
| 2026-03-08 09:18 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-sonnet-4-6 | - | 62.9% | 56 | 33 | 3600s | 13 | 0 | 1h06m | 192.5M | 2.0M |
| 2026-03-08 08:31 | OpenClaw (2026.3.1) | wandb | moonshotai | Kimi-K2.5 | - | 39.3% | 35 | 53 | 3600s | 13 | 1 | 1h21m | 188.4M | 1.6M |
| 2026-03-08 08:10 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-sonnet-4-6 | - | 62.9% | 56 | 33 | 3600s | 6 | 0 | 1h08m | 189.4M | 1.9M |
| 2026-03-08 07:14 | OpenClaw (2026.3.1) | wandb | moonshotai | Kimi-K2.5 | - | 34.8% | 31 | 58 | 3600s | 7 | 0 | 1h16m | 228.8M | 1.6M |
| 2026-03-08 06:21 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-sonnet-4-6 | - | 64.0% | 57 | 32 | 3600s | 8 | 0 | 1h48m | 151.2M | 1.9M |
| 2026-03-08 05:38 | OpenClaw (2026.3.1) | wandb | moonshotai | Kimi-K2.5 | - | 44.9% | 40 | 49 | 3600s | 11 | 0 | 1h35m | 176.6M | 1.4M |
| 2026-03-08 05:12 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-opus-4-6 | - | 57.3% | 51 | 38 | 3600s | 10 | 0 | 1h09m | 97.6M | 1.4M |
| 2026-03-08 04:09 | OpenClaw (2026.3.1) | wandb | moonshotai | Kimi-K2.5 | - | 38.2% | 34 | 54 | 3600s | 6 | 1 | 1h29m | 171.4M | 1.6M |
| 2026-03-08 04:03 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-opus-4-6 | - | 57.3% | 51 | 38 | 3600s | 7 | 0 | 1h08m | 78.6M | 1.3M |
| 2026-03-08 02:53 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-opus-4-6 | - | 56.2% | 50 | 39 | 3600s | 10 | 0 | 1h10m | 74.2M | 1.3M |
| 2026-03-08 01:44 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-opus-4-6 | - | 58.4% | 52 | 37 | 3600s | 8 | 0 | 1h08m | 73.8M | 1.3M |
| 2026-03-08 00:37 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-opus-4-6 | - | 58.4% | 52 | 37 | 3600s | 5 | 0 | 1h07m | 83.4M | 1.3M |
| 2026-03-08 00:33 | Terminus-2 (2.0.0) | openai | openai | gpt-5.4 | - | 44.9% | 40 | 48 | 3600s | 12 | 1 | 2h06m | 726.8M | 1.0M |
| 2026-03-07 23:28 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-sonnet-4-6 | - | 51.7% | 46 | 41 | 3600s | 3 | 2 | 1h08m | 95.3M | 2.1M |
| 2026-03-07 23:25 | Terminus-2 (2.0.0) | openai | openai | gpt-5.4 | - | 43.8% | 39 | 50 | 3600s | 14 | 0 | 1h08m | 667.0M | 905K |
| 2026-03-07 22:16 | Terminus-2 (2.0.0) | openai | openai | gpt-5.4 | - | 42.7% | 38 | 51 | 3600s | 12 | 0 | 1h08m | 707.0M | 878K |
| 2026-03-07 22:07 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-sonnet-4-6 | - | 55.1% | 49 | 40 | 3600s | 5 | 0 | 1h20m | 86.8M | 2.0M |
| 2026-03-07 20:58 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-sonnet-4-6 | - | 56.2% | 50 | 39 | 3600s | 3 | 0 | 1h09m | 78.6M | 2.0M |
| 2026-03-07 20:10 | Terminus-2 (2.0.0) | openai | openai | gpt-5.4 | - | 41.6% | 37 | 51 | 3600s | 15 | 1 | 2h05m | 759.6M | 939K |
| 2026-03-07 19:47 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-sonnet-4-6 | - | 51.7% | 46 | 43 | 3600s | 2 | 0 | 1h10m | 71.6M | 2.0M |
| 2026-03-07 18:57 | Terminus-2 (2.0.0) | openai | openai | gpt-5.4 | - | 47.2% | 42 | 46 | 3600s | 14 | 1 | 1h12m | 775.5M | 982K |
| 2026-03-07 18:15 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-sonnet-4-6 | - | 48.3% | 43 | 46 | 3600s | 6 | 0 | 1h31m | 115.2M | 2.3M |
| 2026-03-07 16:53 | Claude Code (2.1.63) | anthropic | anthropic | claude-opus-4-6 | - | 67.4% | 60 | 28 | 3600s | 6 | 1 | 1h22m | 222.3M | 1.2M |
| 2026-03-07 15:47 | Claude Code (2.1.63) | anthropic | anthropic | claude-opus-4-6 | - | 62.9% | 56 | 33 | 3600s | 4 | 0 | 1h05m | 195.9M | 1.6M |
| 2026-03-07 14:27 | Claude Code (2.1.63) | anthropic | anthropic | claude-opus-4-6 | - | 58.4% | 52 | 36 | 3600s | 6 | 1 | 1h20m | 169.0M | 1.2M |
| 2026-03-07 13:18 | Claude Code (2.1.63) | anthropic | anthropic | claude-opus-4-6 | - | 59.6% | 53 | 36 | 3600s | 7 | 0 | 1h09m | 188.9M | 1.2M |
| 2026-03-07 11:53 | Claude Code (2.1.63) | anthropic | anthropic | claude-opus-4-6 | - | 67.4% | 60 | 29 | 3600s | 5 | 0 | 1h24m | 209.0M | 1.4M |
| 2026-03-07 10:13 | Claude Code (2.1.63) | anthropic | anthropic | claude-sonnet-4-6 | - | 53.9% | 48 | 41 | 3600s | 12 | 0 | 1h39m | 202.1M | 2.1M |
| 2026-03-07 09:08 | Claude Code (2.1.63) | anthropic | anthropic | claude-sonnet-4-6 | - | 57.3% | 51 | 38 | 3600s | 4 | 0 | 1h05m | 166.9M | 1.7M |
| 2026-03-07 08:02 | Claude Code (2.1.63) | anthropic | anthropic | claude-sonnet-4-6 | - | 62.9% | 56 | 33 | 3600s | 3 | 0 | 1h05m | 185.9M | 1.8M |
| 2026-03-07 06:56 | Claude Code (2.1.63) | anthropic | anthropic | claude-sonnet-4-6 | - | 57.3% | 51 | 38 | 3600s | 6 | 0 | 1h05m | 210.7M | 2.2M |
| 2026-03-07 04:37 | Claude Code (2.1.63) | anthropic | anthropic | claude-sonnet-4-6 | - | 56.2% | 50 | 38 | 3600s | 5 | 1 | 2h19m | 216.0M | 2.3M |
| 2026-03-06 03:38 | OpenClaw (2026.3.1) | openrouter | openai | gpt-5.4 | - | 22.5% | 20 | 69 | 7200s | 3 | 0 | 2h03m | 32.2M | 293K |
| 2026-03-06 01:22 | OpenClaw (2026.3.1) | openrouter | openai | gpt-5.4 | - | 34.8% | 31 | 58 | 7200s | 9 | 0 | 2h16m | 32.0M | 289K |
| 2026-03-05 23:18 | OpenClaw (2026.3.1) | openrouter | openai | gpt-5.4 | - | 28.1% | 25 | 64 | 7200s | 7 | 0 | 2h04m | 37.2M | 299K |
| 2026-03-05 21:12 | OpenClaw (2026.3.1) | openrouter | openai | gpt-5.4 | - | 34.8% | 31 | 58 | 7200s | 6 | 0 | 2h06m | 34.4M | 300K |
| 2026-03-05 19:07 | OpenClaw (2026.3.1) | openrouter | openai | gpt-5.4 | - | 28.1% | 25 | 64 | 7200s | 9 | 0 | 2h04m | 31.5M | 301K |
| 2026-03-04 12:30 | Claude Code (2.1.68) | anthropic | anthropic | claude-sonnet-4-6 | - | 55.1% | 49 | 40 | 7200s | 3 | 0 | 2h03m | 364.3M | 3.1M |
| 2026-03-04 10:16 | Claude Code (2.1.68) | anthropic | anthropic | claude-sonnet-4-6 | - | 52.8% | 47 | 40 | 7200s | 3 | 2 | 2h13m | 411.8M | 4.1M |
| 2026-03-04 08:10 | Claude Code (2.1.66) | anthropic | anthropic | claude-sonnet-4-6 | - | 58.4% | 52 | 36 | 7200s | 2 | 1 | 2h05m | 428.8M | 4.1M |
| 2026-03-04 06:06 | Claude Code (2.1.66) | anthropic | anthropic | claude-sonnet-4-6 | - | 59.6% | 53 | 35 | 7200s | 3 | 1 | 2h03m | 347.5M | 4.1M |
| 2026-03-04 03:44 | Claude Code (2.1.66) | anthropic | anthropic | claude-sonnet-4-6 | - | 53.9% | 48 | 40 | 7200s | 2 | 1 | 2h21m | 361.1M | 4.0M |
| 2026-03-04 00:57 | Claude Code (2.1.63) | anthropic | anthropic | claude-opus-4-6 | - | 66.3% | 59 | 29 | 7200s | 3 | 1 | 2h10m | 475.9M | 3.5M |
| 2026-03-03 22:18 | Claude Code (2.1.63) | anthropic | anthropic | claude-opus-4-6 | - | 58.4% | 52 | 36 | 7200s | 3 | 1 | 2h04m | 314.3M | 2.5M |
| 2026-03-03 18:30 | Claude Code (2.1.63) | anthropic | anthropic | claude-opus-4-6 | - | 62.9% | 56 | 31 | 7200s | 1 | 2 | 2h03m | 425.4M | 3.4M |
| 2026-03-03 15:54 | Claude Code (2.1.63) | anthropic | anthropic | claude-opus-4-6 | - | 68.5% | 61 | 27 | 7200s | 3 | 1 | 2h05m | 404.9M | 3.1M |
| 2026-03-03 12:25 | Claude Code (2.1.63) | anthropic | anthropic | claude-opus-4-6 | - | 61.8% | 55 | 32 | 7200s | 4 | 2 | 2h08m | 358.8M | 2.1M |
| 2026-03-03 02:15 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-sonnet-4-6 | - | 50.6% | 45 | 44 | 7200s | 1 | 0 | 2h06m | 78.0M | 1.7M |
| 2026-03-02 23:43 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-sonnet-4-6 | - | 51.7% | 46 | 42 | 7200s | 3 | 1 | 2h18m | 91.0M | 2.2M |
| 2026-03-02 18:37 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-sonnet-4-6 | - | 52.8% | 47 | 42 | 7200s | 2 | 0 | 2h11m | 97.2M | 2.3M |
| 2026-03-02 15:22 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-sonnet-4-6 | - | 58.4% | 52 | 37 | 7200s | 4 | 0 | 2h12m | 111.8M | 2.2M |
| 2026-03-02 13:05 | OpenClaw (2026.3.1) | anthropic | anthropic | claude-sonnet-4-6 | - | 52.8% | 47 | 42 | 7200s | 3 | 0 | 2h10m | 81.5M | 2.0M |
| 2026-03-02 12:32 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi-K2.5 | - | 51.7% | 46 | 41 | 7200s | 8 | 2 | 3h51m | 398.8M | 1.9M |
| 2026-03-02 08:42 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi-K2.5 | - | 50.6% | 45 | 42 | 7200s | 3 | 2 | 3h49m | 306.0M | 1.6M |
| 2026-03-02 07:46 | Claude Code (2.1.63) | anthropic | anthropic | claude-sonnet-4-6 | - | 50.6% | 45 | 42 | - | 21 | 2 | 1h48m | 186.8M | 2.5M |
| 2026-03-02 06:37 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi-K2.5 | - | 44.9% | 40 | 47 | 7200s | 2 | 2 | 2h05m | 341.2M | 1.8M |
| 2026-03-02 05:59 | Claude Code (2.1.63) | anthropic | anthropic | claude-sonnet-4-6 | - | 50.6% | 45 | 41 | - | 19 | 3 | 1h46m | 198.0M | 2.1M |
| 2026-03-02 04:30 | Claude Code (2.1.63) | anthropic | anthropic | claude-sonnet-4-6 | - | 53.9% | 48 | 39 | - | 16 | 2 | 1h28m | 219.7M | 2.5M |
| 2026-03-02 04:28 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi-K2.5 | - | 51.7% | 46 | 42 | 7200s | 6 | 1 | 2h08m | 314.7M | 1.7M |
| 2026-03-02 03:01 | Claude Code (2.1.63) | anthropic | anthropic | claude-sonnet-4-6 | - | 53.9% | 48 | 39 | - | 16 | 2 | 1h28m | 232.3M | 2.2M |
| 2026-03-02 01:57 | Claude Code (2.1.63) | anthropic | anthropic | claude-sonnet-4-6 | - | 51.7% | 46 | 42 | - | 19 | 1 | 1h03m | 188.2M | 2.5M |
| 2026-03-02 01:51 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi-K2.5 | - | 44.9% | 40 | 48 | 7200s | 4 | 1 | 2h37m | 330.8M | 1.9M |
| 2026-03-01 22:50 | OpenClaw (2026.2.17) | wandb | moonshotai | Kimi-K2.5 | - | 33.7% | 30 | 58 | 7200s | 4 | 1 | 2h15m | 91.5M | 1.1M |
| 2026-03-01 20:38 | OpenClaw (2026.2.17) | wandb | moonshotai | Kimi-K2.5 | - | 29.2% | 26 | 63 | 7200s | 2 | 0 | 2h11m | 84.4M | 1.1M |
| 2026-03-01 18:26 | OpenClaw (2026.2.17) | wandb | moonshotai | Kimi-K2.5 | - | 34.8% | 31 | 56 | 7200s | 4 | 2 | 2h12m | 83.5M | 1.2M |
| 2026-03-01 16:11 | OpenClaw (2026.2.17) | wandb | moonshotai | Kimi-K2.5 | - | 29.2% | 26 | 63 | 7200s | 3 | 0 | 2h14m | 99.5M | 1.1M |
| 2026-03-01 13:43 | OpenClaw (2026.2.17) | wandb | moonshotai | Kimi-K2.5 | - | 31.5% | 28 | 61 | 7200s | 3 | 0 | 2h27m | 97.2M | 1.1M |
| 2026-02-28 07:31 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-sonnet-4-6 | - | 62.9% | 56 | 33 | 7200s | 7 | 0 | 2h09m | 246.3M | 2.1M |
| 2026-02-28 06:08 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-opus-4-6 | - | 75.3% | 67 | 22 | 7200s | 2 | 0 | 2h03m | 186.1M | 1.8M |
| 2026-02-28 04:06 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-sonnet-4-6 | - | 56.2% | 50 | 39 | 7200s | 10 | 0 | 3h24m | 212.7M | 2.1M |
| 2026-02-28 04:03 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-opus-4-6 | - | 68.5% | 61 | 28 | 7200s | 4 | 0 | 2h05m | 219.1M | 1.8M |
| 2026-02-28 00:41 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-opus-4-6 | - | 74.2% | 66 | 23 | 7200s | 2 | 0 | 2h24m | 237.4M | 1.6M |
| 2026-02-27 22:33 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-opus-4-6 | - | 71.9% | 64 | 25 | 7200s | 1 | 0 | 2h08m | 230.0M | 1.8M |
| 2026-02-27 20:29 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-opus-4-6 | - | 73.0% | 65 | 24 | 7200s | 1 | 0 | 2h03m | 163.1M | 1.4M |
| 2026-02-27 11:54 | OpenClaw (2026.2.17) | anthropic | anthropic | claude-opus-4-6 | - | 51.7% | 46 | 43 | 7200s | 4 | 0 | 3h33m | 81.2M | 1.1M |
| 2026-02-27 09:47 | OpenClaw (2026.2.17) | anthropic | anthropic | claude-opus-4-6 | - | 47.2% | 42 | 47 | 7200s | 3 | 0 | 2h06m | 80.9M | 915K |
| 2026-02-27 08:53 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-sonnet-4-6 | - | 61.8% | 55 | 34 | 7200s | 7 | 0 | 2h06m | 204.6M | 2.1M |
| 2026-02-27 07:32 | OpenClaw (2026.2.17) | anthropic | anthropic | claude-opus-4-6 | - | 56.2% | 50 | 39 | 7200s | 4 | 0 | 2h14m | 63.8M | 831K |
| 2026-02-27 05:25 | OpenClaw (2026.2.17) | anthropic | anthropic | claude-opus-4-6 | - | 52.8% | 47 | 41 | 7200s | 1 | 1 | 2h07m | 91.4M | 983K |
| 2026-02-27 05:19 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-sonnet-4-6 | - | 61.8% | 55 | 34 | 7200s | 7 | 0 | 3h34m | 234.5M | 2.0M |
| 2026-02-27 03:17 | OpenClaw (2026.2.17) | anthropic | anthropic | claude-opus-4-6 | - | 47.2% | 42 | 47 | 7200s | 2 | 0 | 2h07m | 88.5M | 1.1M |
| 2026-02-27 03:12 | Terminus-2 (2.0.0) | anthropic | anthropic | claude-sonnet-4-6 | - | 62.9% | 56 | 33 | 7200s | 5 | 0 | 2h06m | 199.3M | 1.9M |
by Wolfram Ravenwolf – who evaluates models for breakfast, builds agents at night, and preaches AI usefulness all day long.
In this post, we’ll look at why timeouts are necessary and why I locked WolfBench at 1 hour.
When we launched WolfBench, a four-metric evaluation framework built on Terminal-Bench 2.0, we started with a fixed 2-hour timeout per task. After analyzing nearly 10,000 task results across 8 models, I’m changing that to 1 hour. Here’s why—and why it matters for anyone who cares about meaningful AI benchmarks.
Terminal-Bench 2.0 is a real-world agentic benchmark: 89 tasks spanning system administration, DevOps, security, data/ML operations, and problem-solving. Models must plan, execute shell commands, inspect results, debug failures, and iterate—like a real technical professional. No multiple-choice tricks. No toy puzzles. Just real work in sandboxed environments.
But when a model is free to act, it can get stuck in a loop, retrying the same failing approach endlessly, burning tokens and compute for nothing. Timeouts exist to prevent that.
By default, Terminal-Bench 2.0 assigns each task its own timeout, ranging from 10 minutes to over 3 hours. The majority—45 out of 89 tasks—get just 15 minutes.
This severely penalizes slower models and endpoints. Even if a model could solve a task, a temporarily overloaded API or a slightly slower inference speed can prevent it from finishing in time. When you’re evaluating model capability, you don’t want to measure endpoint performance.
That’s why some recent benchmarks (e.g., MiniMax-M2.5, GLM-5) use fixed 2-hour timeouts for every task. Some, like Kimi K2.5, even disable reasoning/thinking—because models that invest in up-front planning over immediate execution can actually score worse in time-constrained agentic benchmarks. (This is also why you always need to read the fine print: two benchmarks with the same name can produce very different results depending on how they were configured.)
Because the data says 1 hour is the sweet spot.
I analyzed 9,636 task results across all models and configurations in my evaluation runs: every pass, every fail, every actual duration. Then I asked a simple question: At each possible timeout cap, how many successful task completions would we lose?
| Timeout Cap | Passes Lost | % of Total |
|---|---|---|
| 15 min | 846 / 4,708 | 18.0% – way too tight |
| 30 min | 395 / 4,708 | 8.4% – still too aggressive |
| 60 min | 136 / 4,708 | 2.9% – the sweet spot |
| 90 min | 76 / 4,708 | 1.6% – diminishing returns |
| 120 min | 1 / 4,708 | 0.0% – almost nothing gained |
Going from 60 to 120 minutes saves just 135 additional passes out of nearly 5,000. That’s a 2.9% gain—in exchange for doubling the maximum time an agent can spend looping on a hopeless task.
A model that can solve a task typically does so well within 60 minutes. A model that’s still looping after an hour is usually not going to find the answer in hour two—it’s just going to burn more tokens failing the same way.
And “looping on hopeless tasks” isn’t hypothetical. Across all evaluated runs, I measured 921.9 hours of wasted compute—time spent on attempts that ran past their default timeout and still failed. The worst offenders burned 40–60 hours each, producing nothing.
Longer timeouts don’t just waste time. They waste tokens, inflate costs, and—in extreme cases—can fill up disk space when models generate output in tight failure loops. The 2-hour cap doubles the blast radius of every one of these failure modes compared to 1 hour.
The 136 passes lost at a 60-minute cap come primarily from just a handful of tasks: mailman (22 lost passes), path-tracing (10), and rstan-to-pystan (9). These are tasks with shorter default timeouts that occasionally need over an hour to succeed. Meanwhile, the task with the longest default timeout in the entire benchmark—build-pov-ray at 200 minutes—has never needed more than 34 minutes to succeed. And sam-cell-seg, with its 120-minute default? It has a 0% pass rate across all attempts. Extra time doesn’t help what can’t be solved.
A fixed 1-hour timeout per task is the optimal configuration for WolfBench. It:
For a benchmark designed to measure what models can do—not how performant the infrastructure is—that’s the right call.
WolfBench is now locked at 1 hour per task. All future evaluations will use this configuration, and I’ve already begun re-running existing models under the new timeout for consistency.
Analysis based on 9,636 task results across 6 models (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Kimi K2.5, MiniMax-M2.5, GLM-5).
Inference sponsored by CoreWeave. Sandbox compute by Daytona. Built with Harbor, Terminal-Bench 2.0, and W&B Weave.