Wolfram Ravenwolf’s Five-Metric Framework · based on Terminal-Bench 2.0
Most benchmarks report a single average. WolfBench shows five metrics that tell the full story – from the rock-solid base of tasks solved every time, through the average, up to the ceiling of everything ever solved – plus the best and worst single runs that frame the spread. Together, they reveal what no single number can: how consistent an AI agent truly is.
Learn more ↓
Across these runs, 88 (99%) of the 89 tasks were solved at least once, 1 (1%) were solved every time, and 1 (1%) were never solved.
| Date | Agent | Provider | Vendor | Model | Think | Score | Pass | Fail | Timeout | Timeouts | Err | Duration | In | Out |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2026-04-06 07:52 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3.1 Flash Lite Preview | - | 24.7% | 22 | 66 | 3600s | 3 | 1 | 1h40m | 172.0M | 2.2M |
| 2026-04-06 06:11 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3.1 Flash Lite Preview | - | 28.1% | 25 | 63 | 3600s | 2 | 1 | 1h40m | 174.2M | 1.5M |
| 2026-04-06 05:06 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3.1 Flash Lite Preview | - | 25.8% | 23 | 66 | 3600s | 2 | 0 | 1h05m | 96.6M | 2.0M |
| 2026-04-06 03:25 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3.1 Flash Lite Preview | - | 21.3% | 19 | 69 | 3600s | 2 | 1 | 1h40m | 156.2M | 1.9M |
| 2026-04-06 02:22 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3.1 Flash Lite Preview | - | 25.8% | 23 | 66 | 3600s | 2 | 0 | 1h02m | 223.3M | 2.4M |
| 2026-04-06 01:18 | OpenClaw (2026.3.11) | Gemini 3.1 Flash Lite Preview | - | 20.2% | 18 | 71 | 3600s | 5 | 0 | 1h04m | 239.4M | 837K | ||
| 2026-04-05 23:37 | OpenClaw (2026.3.11) | Gemini 3.1 Flash Lite Preview | - | 21.3% | 19 | 69 | 3600s | 7 | 1 | 1h40m | 177.6M | 745K | ||
| 2026-04-05 21:45 | OpenClaw (2026.3.11) | Gemini 3.1 Flash Lite Preview | - | 22.5% | 20 | 69 | 3600s | 4 | 0 | 1h51m | 162.3M | 741K | ||
| 2026-04-05 20:04 | OpenClaw (2026.3.11) | Gemini 3.1 Flash Lite Preview | - | 24.7% | 22 | 66 | 3600s | 5 | 1 | 1h40m | 255.9M | 830K | ||
| 2026-04-05 17:06 | OpenClaw (2026.3.11) | Gemini 3.1 Flash Lite Preview | - | 25.8% | 23 | 66 | 3600s | 6 | 0 | 2h57m | 123.1M | 697K | ||
| 2026-04-05 11:05 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3 Flash Preview | - | 41.6% | 37 | 51 | 3600s | 4 | 1 | 1h40m | 284.7M | 1.2M |
| 2026-04-05 09:59 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3 Flash Preview | - | 41.6% | 37 | 52 | 3600s | 5 | 0 | 1h05m | 270.6M | 1.2M |
| 2026-04-05 08:19 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3 Flash Preview | - | 46.1% | 41 | 47 | 3600s | 3 | 1 | 1h40m | 310.4M | 1.3M |
| 2026-04-05 07:15 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3 Flash Preview | - | 48.3% | 43 | 45 | 3600s | 4 | 1 | 1h03m | 490.5M | 1.6M |
| 2026-04-05 06:09 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3 Flash Preview | - | 43.8% | 39 | 50 | 3600s | 5 | 0 | 1h05m | 252.4M | 1.2M |
| 2026-04-05 04:28 | OpenClaw (2026.3.11) | Gemini 3 Flash Preview | - | 40.4% | 36 | 52 | 3600s | 9 | 1 | 1h40m | 210.1M | 472K | ||
| 2026-04-05 02:47 | OpenClaw (2026.3.11) | Gemini 3 Flash Preview | - | 36.0% | 32 | 56 | 3600s | 7 | 1 | 1h40m | 377.6M | 653K | ||
| 2026-04-05 01:06 | OpenClaw (2026.3.11) | Gemini 3 Flash Preview | - | 46.1% | 41 | 47 | 3600s | 9 | 1 | 1h40m | 265.0M | 753K | ||
| 2026-04-04 23:25 | OpenClaw (2026.3.11) | Gemini 3 Flash Preview | - | 40.4% | 36 | 52 | 3600s | 7 | 1 | 1h40m | 210.4M | 670K | ||
| 2026-04-04 21:44 | OpenClaw (2026.3.11) | Gemini 3 Flash Preview | - | 40.4% | 36 | 52 | 3600s | 9 | 1 | 1h40m | 920.6M | 1.2M | ||
| 2026-04-04 08:43 | Terminus-2 (2.0.0) | openrouter | gemma-4-31b-it | - | 32.6% | 29 | 58 | 3600s | 14 | 2 | 2h54m | 24.6M | 770K | |
| 2026-04-04 05:40 | Terminus-2 (2.0.0) | openrouter | gemma-4-31b-it | - | 30.3% | 27 | 56 | 3600s | 16 | 6 | 3h03m | 16.1M | 602K | |
| 2026-04-04 00:26 | Terminus-2 (2.0.0) | openrouter | gemma-4-31b-it | - | 29.5% | 26 | 62 | 3600s | 8 | 1 | 14h50m | 25.8M | 888K | |
| 2026-04-03 22:03 | Terminus-2 (2.0.0) | openrouter | gemma-4-31b-it | - | 37.1% | 33 | 53 | 3600s | 16 | 3 | 2h23m | 25.5M | 753K | |
| 2026-04-03 20:12 | Terminus-2 (2.0.0) | openrouter | gemma-4-31b-it | - | 27.0% | 24 | 64 | 3600s | 20 | 1 | 1h50m | 34.5M | 930K | |
| 2026-04-03 11:44 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3.1 Pro Preview | - | 50.6% | 45 | 44 | 3600s | 0 | 0 | 0h31m | 15.4M | 653K |
| 2026-04-03 11:08 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3.1 Pro Preview | - | 56.2% | 50 | 39 | 3600s | 0 | 0 | 0h35m | 15.6M | 568K |
| 2026-04-03 10:41 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3.1 Pro Preview | - | 52.8% | 47 | 42 | 3600s | 0 | 0 | 0h26m | 23.7M | 700K |
| 2026-04-03 09:01 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3.1 Pro Preview | - | 50.6% | 45 | 43 | 3600s | 1 | 1 | 1h40m | 13.1M | 634K |
| 2026-04-03 07:58 | Terminus-2 (2.0.0) | gemini | gemini | Gemini 3.1 Pro Preview | - | 48.3% | 43 | 45 | 3600s | 1 | 1 | 1h02m | 19.4M | 670K |
| 2026-04-03 06:53 | OpenClaw (2026.3.11) | Gemini 3.1 Pro Preview | - | 59.6% | 53 | 36 | 3600s | 6 | 0 | 1h05m | 228.7M | 638K | ||
| 2026-04-03 05:46 | OpenClaw (2026.3.11) | Gemini 3.1 Pro Preview | - | 57.3% | 51 | 38 | 3600s | 5 | 0 | 1h06m | 226.5M | 748K | ||
| 2026-04-03 04:05 | OpenClaw (2026.3.11) | Gemini 3.1 Pro Preview | - | 60.7% | 54 | 34 | 3600s | 7 | 1 | 1h40m | 131.0M | 652K | ||
| 2026-04-03 02:24 | OpenClaw (2026.3.11) | Gemini 3.1 Pro Preview | - | 62.9% | 56 | 32 | 3600s | 8 | 1 | 1h40m | 239.2M | 696K | ||
| 2026-04-03 01:18 | OpenClaw (2026.3.11) | Gemini 3.1 Pro Preview | - | 56.2% | 50 | 39 | 3600s | 6 | 0 | 1h06m | 102.8M | 485K | ||
| 2026-04-02 12:00 | Hermes Agent (v2026.3.30) | anthropic | anthropic | Claude Opus 4.6 | - | 61.8% | 55 | 31 | 3600s | 5 | 3 | 6h17m | 68.4M | 1.0M |
| 2026-04-02 07:39 | Hermes Agent (v2026.3.30) | wandb | moonshotai | Kimi K2.5 (nvfp4) | - | 39.3% | 35 | 52 | 3600s | 6 | 2 | 5h23m | 75.8M | 1.6M |
| 2026-04-02 07:08 | Hermes Agent (v2026.3.30) | anthropic | anthropic | Claude Opus 4.6 | - | 64.0% | 57 | 32 | 3600s | 5 | 0 | 4h51m | 75.6M | 1.0M |
| 2026-04-02 05:24 | Hermes Agent (v2026.3.30) | openai | openai | GPT-5.4 | - | 70.8% | 63 | 26 | 3600s | 2 | 0 | 2h38m | 80.2M | 996K |
| 2026-04-02 03:25 | Hermes Agent (v2026.3.30) | wandb | moonshotai | Kimi K2.5 (nvfp4) | - | 40.4% | 36 | 52 | 3600s | 6 | 1 | 4h13m | 82.0M | 1.7M |
| 2026-04-02 02:55 | Hermes Agent (v2026.3.30) | openai | openai | GPT-5.4 | - | 65.2% | 58 | 31 | 3600s | 2 | 0 | 2h28m | 70.8M | 960K |
| 2026-04-02 00:41 | Hermes Agent (v2026.3.30) | anthropic | anthropic | Claude Opus 4.6 | - | 67.4% | 60 | 29 | 3600s | 5 | 0 | 6h25m | 76.1M | 1.2M |
| 2026-04-01 23:30 | Hermes Agent (v2026.3.30) | wandb | moonshotai | Kimi K2.5 (nvfp4) | - | 39.3% | 35 | 53 | 3600s | 5 | 1 | 3h54m | 65.4M | 1.3M |
| 2026-04-01 20:09 | Hermes Agent (v2026.3.30) | openai | openai | GPT-5.4 | - | 66.3% | 59 | 30 | 3600s | 3 | 0 | 2h22m | 86.4M | 1.0M |
| 2026-04-01 19:52 | Hermes Agent (v2026.3.30) | wandb | moonshotai | Kimi K2.5 (nvfp4) | - | 44.9% | 40 | 49 | 3600s | 4 | 0 | 3h38m | 77.7M | 1.6M |
| 2026-04-01 19:49 | Hermes Agent (v2026.3.30) | anthropic | anthropic | Claude Opus 4.6 | - | 61.8% | 55 | 34 | 3600s | 4 | 0 | 4h51m | 69.4M | 1.1M |
| 2026-04-01 17:47 | Hermes Agent (v2026.3.30) | openai | openai | GPT-5.4 | - | 65.2% | 58 | 30 | 3600s | 3 | 1 | 2h21m | 67.2M | 900K |
| 2026-04-01 14:45 | Hermes Agent (v2026.3.30) | wandb | moonshotai | Kimi K2.5 (nvfp4) | - | 42.7% | 38 | 50 | 3600s | 7 | 1 | 5h06m | 88.1M | 1.6M |
| 2026-04-01 14:44 | Hermes Agent (v2026.3.30) | openai | openai | GPT-5.4 | - | 64.0% | 57 | 31 | 3600s | 1 | 1 | 3h02m | 64.5M | 847K |
| 2026-04-01 14:44 | Hermes Agent (v2026.3.30) | anthropic | anthropic | Claude Opus 4.6 | - | 64.0% | 57 | 31 | 3600s | 3 | 1 | 5h04m | 69.5M | 1.1M |
| 2026-03-29 07:01 | OpenClaw (2026.3.11) | wandb | zai-org | GLM-5-FP8 | - | 39.3% | 35 | 45 | 3600s | 6 | 9 | 2h57m | 133.6M | 1.1M |
| 2026-03-29 04:05 | OpenClaw (2026.3.11) | wandb | zai-org | GLM-5-FP8 | - | 37.1% | 33 | 51 | 3600s | 1 | 5 | 2h55m | 91.3M | 923K |
| 2026-03-29 01:00 | OpenClaw (2026.3.11) | wandb | zai-org | GLM-5-FP8 | - | 38.2% | 34 | 50 | 3600s | 3 | 5 | 3h04m | 104.7M | 861K |
| 2026-03-27 19:54 | OpenClaw (2026.3.11) | wandb | zai-org | GLM-5-FP8 | - | 31.5% | 28 | 53 | 3600s | 5 | 8 | 3h07m | 102.7M | 923K |
| 2026-03-27 16:16 | OpenClaw (2026.3.11) | wandb | zai-org | GLM-5-FP8 | - | 37.1% | 33 | 47 | 3600s | 2 | 9 | 3h37m | 90.2M | 797K |
| 2026-03-27 13:20 | OpenClaw (2026.3.11) | wandb | MiniMaxAI | MiniMax M2.5 | - | 42.7% | 38 | 47 | 3600s | 0 | 4 | 2h55m | 69.4M | 984K |
| 2026-03-27 11:08 | OpenClaw (2026.3.11) | wandb | MiniMaxAI | MiniMax M2.5 | - | 37.1% | 33 | 49 | 3600s | 1 | 7 | 2h12m | 72.7M | 1.0M |
| 2026-03-27 08:24 | OpenClaw (2026.3.11) | wandb | MiniMaxAI | MiniMax M2.5 | - | 37.1% | 33 | 50 | 3600s | 0 | 6 | 2h42m | 66.5M | 885K |
| 2026-03-27 06:53 | Terminus-2 (2.0.0) | wandb | MiniMaxAI | MiniMax M2.5 | - | 50.6% | 45 | 42 | 3600s | 24 | 2 | 1h31m | 74.9M | 1.4M |
| 2026-03-27 04:47 | Terminus-2 (2.0.0) | wandb | MiniMaxAI | MiniMax M2.5 | - | 43.8% | 39 | 45 | 3600s | 25 | 5 | 2h05m | 114.6M | 1.6M |
| 2026-03-27 02:58 | Terminus-2 (2.0.0) | wandb | MiniMaxAI | MiniMax M2.5 | - | 49.4% | 44 | 43 | 3600s | 17 | 2 | 1h48m | 84.9M | 1.5M |
| 2026-03-26 12:31 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi K2.5 (nvfp4) | - | 47.2% | 42 | 45 | 3600s | 7 | 2 | 1h47m | 306.4M | 2.0M |
| 2026-03-26 09:37 | OpenClaw (2026.3.11) | wandb | MiniMaxAI | MiniMax M2.5 | - | 33.7% | 30 | 53 | 3600s | 3 | 6 | 2h51m | 72.7M | 969K |
| 2026-03-26 07:14 | OpenClaw (2026.3.11) | wandb | MiniMaxAI | MiniMax M2.5 | - | 32.6% | 29 | 52 | 3600s | 1 | 8 | 2h22m | 68.1M | 996K |
| 2026-03-26 06:07 | Terminus-2 (2.0.0) | wandb | MiniMaxAI | MiniMax M2.5 | - | 41.6% | 37 | 50 | 3600s | 31 | 2 | 1h06m | 64.1M | 1.5M |
| 2026-03-26 04:26 | Terminus-2 (2.0.0) | wandb | MiniMaxAI | MiniMax M2.5 | - | 49.4% | 44 | 42 | 3600s | 22 | 3 | 1h41m | 74.7M | 1.4M |
| 2026-03-20 22:45 | Terminus-2 (2.0.0) | wandb | zai-org | GLM-5-FP8 | - | 52.8% | 47 | 42 | 7200s | 20 | 0 | 3h15m | 147.7M | 2.1M |
| 2026-03-20 16:16 | Terminus-2 (2.0.0) | wandb | zai-org | GLM-5-FP8 | - | 47.2% | 42 | 45 | 7200s | 16 | 2 | 3h39m | 127.8M | 2.1M |
| 2026-03-20 06:43 | Terminus-2 (2.0.0) | openrouter | minimax | MiniMax M2.7 | - | 49.4% | 44 | 45 | 3600s | 18 | 0 | 1h33m | 245.0M | 2.4M |
| 2026-03-20 03:45 | Terminus-2 (2.0.0) | openrouter | minimax | MiniMax M2.7 | - | 55.1% | 49 | 39 | 3600s | 16 | 1 | 2h57m | 337.1M | 2.5M |
| 2026-03-20 02:30 | OpenClaw (2026.3.11) | openrouter | minimax | MiniMax M2.7 | - | 49.4% | 44 | 45 | 3600s | 7 | 0 | 1h14m | 135.9M | 2.4M |
| 2026-03-20 00:17 | OpenClaw (2026.3.11) | openrouter | minimax | MiniMax M2.7 | - | 48.3% | 43 | 45 | 3600s | 6 | 1 | 2h12m | 104.3M | 2.2M |
| 2026-03-19 13:39 | Terminus-2 (2.0.0) | openai | openai | GPT‑5.4 nano | - | 20.2% | 18 | 71 | 3600s | 32 | 0 | 2h01m | 1321.4M | 2.6M |
| 2026-03-19 12:31 | Terminus-2 (2.0.0) | openrouter | minimax | MiniMax M2.7 | - | 52.8% | 47 | 42 | 3600s | 15 | 0 | 1h31m | 317.7M | 2.7M |
| 2026-03-19 12:08 | Terminus-2 (2.0.0) | openai | openai | GPT‑5.4 nano | - | 23.6% | 21 | 68 | 3600s | 28 | 0 | 1h31m | 1076.7M | 2.2M |
| 2026-03-19 10:50 | Terminus-2 (2.0.0) | openrouter | minimax | MiniMax M2.7 | - | 47.2% | 42 | 46 | 3600s | 18 | 1 | 1h40m | 249.8M | 2.5M |
| 2026-03-19 10:43 | Terminus-2 (2.0.0) | openai | openai | GPT‑5.4 nano | - | 23.6% | 21 | 68 | 3600s | 21 | 0 | 1h24m | 1085.3M | 2.0M |
| 2026-03-19 09:29 | Terminus-2 (2.0.0) | mistral | mistral | Mistral Small 4 119B A6B | - | 25.8% | 23 | 59 | 3600s | 1 | 7 | 1h54m | 147.8M | 1.1M |
| 2026-03-19 09:19 | Terminus-2 (2.0.0) | openrouter | minimax | MiniMax M2.7 | - | 55.1% | 49 | 40 | 3600s | 19 | 0 | 1h30m | 192.0M | 2.4M |
| 2026-03-19 09:18 | Terminus-2 (2.0.0) | openai | openai | GPT‑5.4 mini | - | 27.0% | 24 | 65 | 3600s | 24 | 0 | 1h24m | 845.5M | 1.3M |
| 2026-03-19 08:13 | Terminus-2 (2.0.0) | mistral | mistral | Mistral Small 4 119B A6B | - | 21.3% | 19 | 70 | 3600s | 4 | 0 | 1h15m | 455.3M | 1.6M |
| 2026-03-19 07:38 | OpenClaw (2026.3.11) | openrouter | minimax | MiniMax M2.7 | - | 46.1% | 41 | 47 | 3600s | 4 | 1 | 1h40m | 100.8M | 2.3M |
| 2026-03-19 07:37 | Terminus-2 (2.0.0) | openai | openai | GPT‑5.4 mini | - | 25.8% | 23 | 65 | 3600s | 21 | 1 | 1h40m | 810.4M | 1.3M |
| 2026-03-19 06:59 | Terminus-2 (2.0.0) | mistral | mistral | Mistral Small 4 119B A6B | - | 23.6% | 21 | 68 | 3600s | 4 | 0 | 1h13m | 232.2M | 1.5M |
| 2026-03-19 05:57 | OpenClaw (2026.3.11) | openrouter | minimax | MiniMax M2.7 | - | 42.7% | 38 | 50 | 3600s | 6 | 1 | 1h40m | 113.1M | 2.5M |
| 2026-03-19 05:57 | Terminus-2 (2.0.0) | openai | openai | GPT‑5.4 mini | - | 25.8% | 23 | 66 | 3600s | 17 | 0 | 1h39m | 847.8M | 1.3M |
| 2026-03-19 05:53 | OpenClaw (2026.3.11) | mistral | mistral | Mistral Small 4 119B A6B | - | 18.0% | 16 | 72 | 3600s | 4 | 1 | 1h05m | 110.5M | 772K |
| 2026-03-19 04:56 | OpenClaw (2026.3.11) | openai | openai | GPT‑5.4 nano | - | 12.4% | 11 | 78 | 3600s | 1 | 0 | 1h00m | 25.2M | 156K |
| 2026-03-19 04:47 | OpenClaw (2026.3.11) | mistral | mistral | Mistral Small 4 119B A6B | - | 16.9% | 15 | 74 | 3600s | 6 | 0 | 1h05m | 120.9M | 842K |
| 2026-03-19 04:11 | OpenClaw (2026.3.11) | openrouter | minimax | MiniMax M2.7 | - | 41.6% | 37 | 51 | 3600s | 3 | 1 | 1h45m | 126.5M | 2.3M |
| 2026-03-19 03:54 | OpenClaw (2026.3.11) | openai | openai | GPT‑5.4 nano | - | 13.5% | 12 | 77 | 3600s | 1 | 0 | 1h02m | 19.4M | 143K |
| 2026-03-19 03:24 | OpenClaw (2026.3.11) | mistral | mistral | Mistral Small 4 119B A6B | - | 15.7% | 14 | 75 | 3600s | 7 | 0 | 1h23m | 115.9M | 758K |
| 2026-03-19 02:53 | OpenClaw (2026.3.11) | openai | openai | GPT‑5.4 nano | - | 16.9% | 15 | 74 | 3600s | 1 | 0 | 1h00m | 13.4M | 123K |
| 2026-03-18 08:05 | OpenClaw (2026.3.11) | openai | openai | GPT‑5.4 mini | - | 10.1% | 9 | 80 | 3600s | 2 | 0 | 1h02m | 20.3M | 170K |
| 2026-03-18 07:03 | OpenClaw (2026.3.11) | openai | openai | GPT‑5.4 mini | - | 14.6% | 13 | 76 | 3600s | 3 | 0 | 1h02m | 16.7M | 159K |
| 2026-03-18 06:01 | OpenClaw (2026.3.11) | openai | openai | GPT‑5.4 mini | - | 14.6% | 13 | 76 | 3600s | 2 | 0 | 1h02m | 21.0M | 164K |
| 2026-03-18 04:58 | OpenClaw (2026.3.11) | openai | openai | GPT‑5.4 mini | - | 18.0% | 16 | 73 | 3600s | 1 | 0 | 1h02m | 19.4M | 162K |
| 2026-03-18 03:56 | OpenClaw (2026.3.11) | openai | openai | GPT‑5.4 mini | - | 13.5% | 12 | 77 | 3600s | 1 | 0 | 1h02m | 16.9M | 155K |
| 2026-03-16 13:58 | Terminus-2 (2.0.0) | openrouter | z-ai | GLM-5-Turbo | - | 49.4% | 44 | 43 | 3600s | 13 | 2 | 2h15m | 361.5M | 2.7M |
| 2026-03-16 11:53 | Terminus-2 (2.0.0) | openrouter | z-ai | GLM-5-Turbo | - | 46.1% | 41 | 48 | 3600s | 14 | 0 | 2h03m | 285.8M | 2.5M |
| 2026-03-16 10:27 | OpenClaw (2026.3.11) | openrouter | z-ai | GLM-5-Turbo | - | 47.2% | 42 | 47 | 3600s | 7 | 0 | 1h25m | 117.6M | 3.4M |
| 2026-03-16 09:17 | OpenClaw (2026.3.11) | openrouter | z-ai | GLM-5-Turbo | - | 46.1% | 41 | 48 | 3600s | 6 | 0 | 1h10m | 65.6M | 2.5M |
| 2026-03-16 07:45 | OpenClaw (2026.3.11) | openrouter | z-ai | GLM-5-Turbo | - | 47.2% | 42 | 47 | 3600s | 10 | 0 | 1h31m | 72.0M | 2.8M |
| 2026-03-16 06:04 | OpenClaw (2026.3.11) | openrouter | z-ai | GLM-5-Turbo | - | 49.4% | 44 | 44 | 3600s | 9 | 1 | 1h40m | 87.2M | 2.7M |
| 2026-03-16 04:23 | OpenClaw (2026.3.11) | openrouter | z-ai | GLM-5-Turbo | - | 43.8% | 39 | 49 | 3600s | 6 | 1 | 1h40m | 116.1M | 3.4M |
| 2026-03-16 01:49 | Terminus-2 (2.0.0) | wandb | nvidia | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 | - | 31.5% | 28 | 61 | 3600s | 21 | 0 | 2h01m | 153.9M | 4.0M |
| 2026-03-15 23:55 | Terminus-2 (2.0.0) | wandb | nvidia | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 | - | 38.2% | 34 | 55 | 3600s | 19 | 0 | 1h54m | 150.4M | 3.9M |
| 2026-03-15 22:13 | Terminus-2 (2.0.0) | wandb | nvidia | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 | - | 31.5% | 28 | 61 | 3600s | 16 | 0 | 1h41m | 132.0M | 3.9M |
| 2026-03-15 20:17 | Terminus-2 (2.0.0) | wandb | nvidia | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 | - | 38.2% | 34 | 55 | 3600s | 19 | 0 | 1h55m | 151.3M | 4.0M |
| 2026-03-15 18:06 | Terminus-2 (2.0.0) | wandb | nvidia | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 | - | 39.3% | 35 | 54 | 3600s | 22 | 0 | 2h10m | 177.4M | 4.3M |
| 2026-03-15 01:29 | OpenClaw (2026.3.11) | anthropic | anthropic | Claude Opus 4.6 | max | 56.2% | 50 | 38 | 3600s | 7 | 1 | 1h40m | 75.5M | 1.4M |
| 2026-03-14 23:48 | OpenClaw (2026.3.11) | anthropic | anthropic | Claude Opus 4.6 | max | 52.8% | 47 | 41 | 3600s | 10 | 1 | 1h40m | 87.4M | 1.7M |
| 2026-03-14 22:09 | OpenClaw (2026.3.11) | anthropic | anthropic | Claude Opus 4.6 | max | 59.6% | 53 | 36 | 3600s | 8 | 0 | 1h39m | 76.4M | 1.7M |
| 2026-03-14 20:31 | OpenClaw (2026.3.11) | anthropic | anthropic | Claude Opus 4.6 | max | 58.4% | 52 | 37 | 3600s | 7 | 0 | 1h37m | 90.1M | 1.6M |
| 2026-03-14 19:40 | OpenClaw (2026.3.1) | wandb | nvidia | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 | - | 19.1% | 17 | 71 | 3600s | 4 | 1 | 1h40m | 72.6M | 765K |
| 2026-03-14 18:50 | OpenClaw (2026.3.11) | anthropic | anthropic | Claude Opus 4.6 | max | 59.6% | 53 | 35 | 3600s | 5 | 1 | 1h40m | 100.0M | 1.9M |
| 2026-03-14 18:18 | OpenClaw (2026.3.1) | wandb | nvidia | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 | - | 23.6% | 21 | 68 | 3600s | 7 | 0 | 1h22m | 83.9M | 1.0M |
| 2026-03-14 17:45 | Claude Code (2.1.75) | anthropic | anthropic | Claude Opus 4.6 | max | 60.7% | 54 | 34 | 3600s | 5 | 1 | 1h04m | 146.6M | 1.5M |
| 2026-03-14 17:09 | OpenClaw (2026.3.1) | wandb | nvidia | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 | - | 16.9% | 15 | 74 | 3600s | 4 | 0 | 1h08m | 54.7M | 773K |
| 2026-03-14 16:38 | Claude Code (2.1.75) | anthropic | anthropic | Claude Opus 4.6 | max | 57.3% | 51 | 37 | 3600s | 9 | 1 | 1h07m | 135.0M | 1.5M |
| 2026-03-14 15:33 | OpenClaw (2026.3.1) | wandb | nvidia | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 | - | 21.3% | 19 | 69 | 3600s | 8 | 1 | 1h35m | 75.5M | 961K |
| 2026-03-14 15:20 | Claude Code (2.1.75) | anthropic | anthropic | Claude Opus 4.6 | max | 60.7% | 54 | 34 | 3600s | 6 | 1 | 1h17m | 132.5M | 1.8M |
| 2026-03-14 14:20 | OpenClaw (2026.3.1) | wandb | nvidia | NVIDIA-Nemotron-3-Super-120B-A12B-FP8 | - | 20.2% | 18 | 70 | 3600s | 4 | 1 | 1h12m | 62.0M | 967K |
| 2026-03-14 14:05 | Claude Code (2.1.75) | anthropic | anthropic | Claude Opus 4.6 | max | 60.7% | 54 | 34 | 3600s | 7 | 1 | 1h15m | 146.6M | 1.6M |
| 2026-03-14 12:32 | Claude Code (2.1.75) | anthropic | anthropic | Claude Opus 4.6 | max | 58.4% | 52 | 36 | 3600s | 9 | 1 | 1h32m | 176.0M | 1.4M |
| 2026-03-14 10:34 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Opus 4.6 | max | 55.1% | 49 | 39 | 3600s | 21 | 1 | 1h57m | 77.8M | 2.6M |
| 2026-03-14 08:48 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Opus 4.6 | max | 58.4% | 52 | 36 | 3600s | 16 | 1 | 1h45m | 61.7M | 2.5M |
| 2026-03-14 07:02 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Opus 4.6 | max | 60.7% | 54 | 35 | 3600s | 16 | 0 | 1h45m | 82.0M | 2.3M |
| 2026-03-14 04:57 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Opus 4.6 | max | 61.8% | 55 | 34 | 3600s | 15 | 0 | 2h04m | 75.0M | 2.3M |
| 2026-03-14 04:40 | Claude Code (2.1.63) | openrouter | openai | GPT-5.4 | xhigh | 39.3% | 35 | 54 | 3600s | 7 | 0 | 1h04m | - | - |
| 2026-03-14 03:22 | Claude Code (2.1.63) | openrouter | openai | GPT-5.4 | xhigh | 44.9% | 40 | 48 | 3600s | 10 | 1 | 1h18m | - | - |
| 2026-03-14 02:54 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Opus 4.6 | max | 60.7% | 54 | 35 | 3600s | 19 | 0 | 2h02m | 73.6M | 2.3M |
| 2026-03-14 02:06 | Claude Code (2.1.63) | openrouter | openai | GPT-5.4 | xhigh | 44.9% | 40 | 49 | 3600s | 10 | 0 | 1h15m | - | - |
| 2026-03-14 01:02 | Claude Code (2.1.63) | openrouter | openai | GPT-5.4 | xhigh | 50.6% | 45 | 44 | 3600s | 5 | 0 | 1h04m | - | - |
| 2026-03-13 23:48 | Claude Code (2.1.63) | openrouter | openai | GPT-5.4 | xhigh | 49.4% | 44 | 44 | 3600s | 12 | 1 | 1h13m | - | - |
| 2026-03-12 21:45 | OpenClaw (2026.3.11) | openai | openai | GPT-5.4 | - | 59.6% | 53 | 35 | 3600s | 12 | 1 | 1h40m | 57.9M | 577K |
| 2026-03-12 20:33 | OpenClaw (2026.3.11) | openai | openai | GPT-5.4 | - | 61.8% | 55 | 34 | 3600s | 11 | 0 | 1h10m | 81.0M | 613K |
| 2026-03-12 19:23 | OpenClaw (2026.3.11) | openai | openai | GPT-5.4 | - | 57.3% | 51 | 38 | 3600s | 10 | 0 | 1h10m | 70.8M | 602K |
| 2026-03-12 18:18 | OpenClaw (2026.3.11) | openai | openai | GPT-5.4 | - | 59.6% | 53 | 36 | 3600s | 10 | 0 | 1h05m | 67.8M | 618K |
| 2026-03-12 17:08 | OpenClaw (2026.3.11) | openai | openai | GPT-5.4 | - | 66.3% | 59 | 30 | 3600s | 10 | 0 | 1h09m | 79.2M | 603K |
| 2026-03-12 12:49 | OpenClaw (2026.3.11) | openai | openai | GPT-5.4 | xhigh | 70.8% | 63 | 26 | 3600s | 13 | 0 | 1h12m | 141.0M | 1.7M |
| 2026-03-12 11:27 | OpenClaw (2026.3.11) | openai | openai | GPT-5.4 | xhigh | 71.9% | 64 | 25 | 3600s | 11 | 0 | 1h21m | 135.4M | 1.7M |
| 2026-03-12 10:27 | Terminus-2 (2.0.0) | openai | openai | GPT-5.4 | xhigh | 67.4% | 60 | 28 | 3600s | 11 | 1 | 2h15m | 14.7M | 6.9M |
| 2026-03-12 10:16 | OpenClaw (2026.3.11) | openai | openai | GPT-5.4 | xhigh | 70.8% | 63 | 26 | 3600s | 13 | 0 | 1h10m | 147.5M | 1.8M |
| 2026-03-12 09:03 | OpenClaw (2026.3.11) | openai | openai | GPT-5.4 | xhigh | 69.7% | 62 | 27 | 3600s | 12 | 0 | 1h12m | 156.4M | 1.9M |
| 2026-03-12 08:39 | Terminus-2 (2.0.0) | openai | openai | GPT-5.4 | xhigh | 73.0% | 65 | 24 | 3600s | 11 | 0 | 1h47m | 13.5M | 6.3M |
| 2026-03-12 07:38 | OpenClaw (2026.3.11) | openai | openai | GPT-5.4 | xhigh | 71.9% | 64 | 25 | 3600s | 10 | 0 | 1h25m | 145.6M | 1.6M |
| 2026-03-12 06:17 | Terminus-2 (2.0.0) | openai | openai | GPT-5.4 | xhigh | 64.0% | 57 | 31 | 3600s | 10 | 1 | 2h21m | 12.4M | 6.1M |
| 2026-03-12 04:57 | Terminus-2 (2.0.0) | openai | openai | GPT-5.4 | xhigh | 70.8% | 63 | 26 | 3600s | 8 | 0 | 1h19m | 10.3M | 5.5M |
| 2026-03-12 03:25 | Terminus-2 (2.0.0) | openai | openai | GPT-5.4 | xhigh | 69.7% | 62 | 27 | 3600s | 11 | 0 | 1h31m | 13.3M | 5.7M |
| 2026-03-10 12:31 | Terminus-2 (2.0.0) | openai | openai | Kimi K2.5 (nvfp4) | - | 47.2% | 42 | 47 | 3600s | 12 | 0 | 1h27m | 114.4M | 1.8M |
| 2026-03-10 11:30 | OpenClaw (2026.3.1) | openai | openai | GPT-5.3-Codex | - | 53.9% | 48 | 41 | 3600s | 5 | 0 | 1h04m | 33.1M | 360K |
| 2026-03-10 10:15 | Terminus-2 (2.0.0) | openai | openai | Kimi K2.5 (nvfp4) | - | 49.4% | 44 | 44 | 3600s | 11 | 1 | 2h15m | 140.6M | 2.2M |
| 2026-03-10 09:49 | OpenClaw (2026.3.1) | openai | openai | GPT-5.3-Codex | - | 55.1% | 49 | 39 | 3600s | 8 | 1 | 1h40m | 31.8M | 339K |
| 2026-03-10 08:43 | OpenClaw (2026.3.1) | openai | openai | GPT-5.3-Codex | - | 56.2% | 50 | 38 | 3600s | 7 | 1 | 1h05m | 35.6M | 356K |
| 2026-03-10 08:42 | Terminus-2 (2.0.0) | openai | openai | Kimi K2.5 (nvfp4) | - | 46.1% | 41 | 48 | 3600s | 13 | 0 | 1h32m | 138.5M | 2.0M |
| 2026-03-10 07:38 | OpenClaw (2026.3.1) | openai | openai | GPT-5.3-Codex | - | 56.2% | 50 | 39 | 3600s | 5 | 0 | 1h04m | 31.6M | 371K |
| 2026-03-10 06:29 | OpenClaw (2026.3.1) | openai | openai | GPT-5.3-Codex | - | 53.9% | 48 | 41 | 3600s | 8 | 0 | 1h09m | 30.3M | 351K |
| 2026-03-10 06:25 | Terminus-2 (2.0.0) | openai | openai | Kimi K2.5 (nvfp4) | - | 46.1% | 41 | 47 | 3600s | 13 | 1 | 2h17m | 116.0M | 2.1M |
| 2026-03-10 05:24 | Claude Code (2.1.63) | openrouter | openai | GPT-5.3-Codex | - | 50.6% | 45 | 44 | 3600s | 8 | 0 | 1h04m | - | - |
| 2026-03-10 04:58 | OpenClaw (2026.3.1) | custom | custom | Kimi K2.5 (nvfp4) | - | 37.1% | 33 | 56 | 3600s | 14 | 0 | 1h26m | 181.2M | 1.5M |
| 2026-03-10 04:18 | Claude Code (2.1.63) | openrouter | openai | GPT-5.3-Codex | - | 53.9% | 48 | 41 | 3600s | 6 | 0 | 1h05m | - | - |
| 2026-03-10 03:33 | OpenClaw (2026.3.1) | custom | custom | Kimi K2.5 (nvfp4) | - | 38.2% | 34 | 55 | 3600s | 10 | 0 | 1h25m | 144.8M | 1.3M |
| 2026-03-10 03:14 | Claude Code (2.1.63) | openrouter | openai | GPT-5.3-Codex | - | 51.7% | 46 | 43 | 3600s | 6 | 0 | 1h04m | - | - |
| 2026-03-10 02:10 | Claude Code (2.1.63) | openrouter | openai | GPT-5.3-Codex | - | 55.1% | 49 | 40 | 3600s | 5 | 0 | 1h04m | - | - |
| 2026-03-10 01:37 | OpenClaw (2026.3.1) | custom | custom | Kimi K2.5 (nvfp4) | - | 33.7% | 30 | 59 | 3600s | 9 | 0 | 1h55m | 167.6M | 1.4M |
| 2026-03-10 01:06 | Claude Code (2.1.63) | openrouter | openai | GPT-5.3-Codex | - | 49.4% | 44 | 45 | 3600s | 6 | 0 | 1h03m | - | - |
| 2026-03-10 00:20 | OpenClaw (2026.3.1) | custom | custom | Kimi K2.5 (nvfp4) | - | 38.2% | 34 | 55 | 3600s | 10 | 0 | 1h16m | 237.4M | 1.6M |
| 2026-03-09 23:12 | OpenClaw (2026.3.1) | custom | custom | Kimi K2.5 (nvfp4) | - | 37.1% | 33 | 55 | 3600s | 10 | 1 | 1h07m | 92.9M | 1.1M |
| 2026-03-09 22:57 | Claude Code (2.1.63) | openrouter | openai | GPT-5.4 | - | 46.1% | 41 | 48 | 3600s | 5 | 0 | 1h03m | - | - |
| 2026-03-09 21:38 | Claude Code (2.1.63) | openrouter | openai | GPT-5.4 | - | 51.7% | 46 | 42 | 3600s | 9 | 1 | 1h18m | - | - |
| 2026-03-09 20:27 | Claude Code (2.1.63) | openrouter | openai | GPT-5.4 | - | 48.3% | 43 | 46 | 3600s | 7 | 0 | 1h10m | - | - |
| 2026-03-09 19:22 | Claude Code (2.1.63) | openrouter | openai | GPT-5.4 | - | 51.7% | 46 | 42 | 3600s | 5 | 1 | 1h05m | - | - |
| 2026-03-09 18:12 | Claude Code (2.1.63) | openrouter | openai | GPT-5.4 | - | 43.8% | 39 | 50 | 3600s | 5 | 0 | 1h09m | - | - |
| 2026-03-09 14:05 | Terminus-2 (2.0.0) | openai | openai | GPT-5.3-Codex | - | 38.2% | 34 | 55 | 3600s | 13 | 0 | 1h14m | 457.8M | 622K |
| 2026-03-09 12:59 | Terminus-2 (2.0.0) | openai | openai | GPT-5.3-Codex | - | 41.6% | 37 | 52 | 3600s | 13 | 0 | 1h05m | 478.0M | 570K |
| 2026-03-09 11:33 | Terminus-2 (2.0.0) | openai | openai | GPT-5.3-Codex | - | 38.2% | 34 | 55 | 3600s | 17 | 0 | 1h25m | 628.9M | 671K |
| 2026-03-09 09:42 | Terminus-2 (2.0.0) | openai | openai | GPT-5.3-Codex | - | 39.3% | 35 | 53 | 3600s | 11 | 1 | 1h50m | 515.0M | 646K |
| 2026-03-09 08:09 | Terminus-2 (2.0.0) | openai | openai | GPT-5.3-Codex | - | 38.2% | 34 | 55 | 3600s | 13 | 0 | 1h32m | 480.1M | 687K |
| 2026-03-09 07:46 | OpenClaw (2026.3.1) | openai | openai | GPT-5.4 | - | 28.1% | 25 | 64 | 3600s | 7 | 0 | 1h04m | 29.5M | 288K |
| 2026-03-09 07:00 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Opus 4.6 | - | 69.7% | 62 | 26 | 3600s | 8 | 1 | 1h08m | 146.4M | 1.6M |
| 2026-03-09 06:38 | OpenClaw (2026.3.1) | openai | openai | GPT-5.4 | - | 32.6% | 29 | 60 | 3600s | 11 | 0 | 1h07m | 35.7M | 345K |
| 2026-03-09 05:53 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Opus 4.6 | - | 71.9% | 64 | 25 | 3600s | 5 | 0 | 1h06m | 151.5M | 1.5M |
| 2026-03-09 05:33 | OpenClaw (2026.3.1) | openai | openai | GPT-5.4 | - | 31.5% | 28 | 61 | 3600s | 7 | 0 | 1h04m | 34.4M | 326K |
| 2026-03-09 04:27 | OpenClaw (2026.3.1) | openai | openai | GPT-5.4 | - | 30.3% | 27 | 62 | 3600s | 7 | 0 | 1h06m | 42.8M | 322K |
| 2026-03-09 04:13 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Opus 4.6 | - | 70.8% | 63 | 25 | 3600s | 8 | 1 | 1h40m | 175.6M | 1.7M |
| 2026-03-09 03:21 | OpenClaw (2026.3.1) | openai | openai | GPT-5.4 | - | 29.2% | 26 | 63 | 3600s | 5 | 0 | 1h05m | 29.6M | 333K |
| 2026-03-09 03:07 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Opus 4.6 | - | 68.5% | 61 | 28 | 3600s | 4 | 0 | 1h05m | 153.0M | 1.4M |
| 2026-03-08 19:15 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi K2.5 (int4) | - | 51.7% | 46 | 42 | 3600s | 14 | 1 | 1h26m | 204.5M | 1.7M |
| 2026-03-08 17:46 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi K2.5 (int4) | - | 48.3% | 43 | 46 | 3600s | 12 | 0 | 1h29m | 193.4M | 1.7M |
| 2026-03-08 16:04 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi K2.5 (int4) | - | 46.1% | 41 | 48 | 3600s | 14 | 0 | 1h41m | 236.0M | 1.7M |
| 2026-03-08 14:26 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Opus 4.6 | - | 75.3% | 67 | 22 | 3600s | 3 | 0 | 1h03m | 155.2M | 1.4M |
| 2026-03-08 13:51 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi K2.5 (int4) | - | 49.4% | 44 | 44 | 3600s | 13 | 1 | 2h12m | 195.4M | 1.7M |
| 2026-03-08 12:24 | Terminus-2 (2.0.0) | wandb | moonshotai | Kimi K2.5 (int4) | - | 46.1% | 41 | 48 | 3600s | 15 | 0 | 1h26m | 197.7M | 1.7M |
| 2026-03-08 12:12 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Sonnet 4.6 | - | 61.8% | 55 | 34 | 3600s | 10 | 0 | 1h07m | 259.5M | 2.2M |
| 2026-03-08 10:25 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Sonnet 4.6 | - | 59.6% | 53 | 36 | 3600s | 7 | 0 | 1h46m | 216.2M | 1.9M |
| 2026-03-08 09:53 | OpenClaw (2026.3.1) | wandb | moonshotai | Kimi K2.5 (int4) | - | 37.1% | 33 | 55 | 3600s | 13 | 1 | 2h30m | 192.2M | 1.6M |
| 2026-03-08 09:18 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Sonnet 4.6 | - | 62.9% | 56 | 33 | 3600s | 13 | 0 | 1h06m | 192.5M | 2.0M |
| 2026-03-08 08:31 | OpenClaw (2026.3.1) | wandb | moonshotai | Kimi K2.5 (int4) | - | 39.3% | 35 | 53 | 3600s | 13 | 1 | 1h21m | 188.4M | 1.6M |
| 2026-03-08 08:10 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Sonnet 4.6 | - | 62.9% | 56 | 33 | 3600s | 6 | 0 | 1h08m | 189.4M | 1.9M |
| 2026-03-08 07:14 | OpenClaw (2026.3.1) | wandb | moonshotai | Kimi K2.5 (int4) | - | 34.8% | 31 | 58 | 3600s | 7 | 0 | 1h16m | 228.8M | 1.6M |
| 2026-03-08 06:21 | Terminus-2 (2.0.0) | anthropic | anthropic | Claude Sonnet 4.6 | - | 64.0% | 57 | 32 | 3600s | 8 | 0 | 1h48m | 151.2M | 1.9M |
| 2026-03-08 05:38 | OpenClaw (2026.3.1) | wandb | moonshotai | Kimi K2.5 (int4) | - | 44.9% | 40 | 49 | 3600s | 11 | 0 | 1h35m | 176.6M | 1.4M |
| 2026-03-08 05:12 | OpenClaw (2026.3.1) | anthropic | anthropic | Claude Opus 4.6 | - | 57.3% | 51 | 38 | 3600s | 10 | 0 | 1h09m | 97.6M | 1.4M |
| 2026-03-08 04:09 | OpenClaw (2026.3.1) | wandb | moonshotai | Kimi K2.5 (int4) | - | 38.2% | 34 | 54 | 3600s | 6 | 1 | 1h29m | 171.4M | 1.6M |
| 2026-03-08 04:03 | OpenClaw (2026.3.1) | anthropic | anthropic | Claude Opus 4.6 | - | 57.3% | 51 | 38 | 3600s | 7 | 0 | 1h08m | 78.6M | 1.3M |
| 2026-03-08 02:53 | OpenClaw (2026.3.1) | anthropic | anthropic | Claude Opus 4.6 | - | 56.2% | 50 | 39 | 3600s | 10 | 0 | 1h10m | 74.2M | 1.3M |
| 2026-03-08 01:44 | OpenClaw (2026.3.1) | anthropic | anthropic | Claude Opus 4.6 | - | 58.4% | 52 | 37 | 3600s | 8 | 0 | 1h08m | 73.8M | 1.3M |
| 2026-03-08 00:37 | OpenClaw (2026.3.1) | anthropic | anthropic | Claude Opus 4.6 | - | 58.4% | 52 | 37 | 3600s | 5 | 0 | 1h07m | 83.4M | 1.3M |
| 2026-03-08 00:33 | Terminus-2 (2.0.0) | openai | openai | GPT-5.4 | - | 44.9% | 40 | 48 | 3600s | 12 | 1 | 2h06m | 726.8M | 1.0M |
| 2026-03-07 23:28 | OpenClaw (2026.3.1) | anthropic | anthropic | Claude Sonnet 4.6 | - | 51.7% | 46 | 41 | 3600s | 3 | 2 | 1h08m | 95.3M | 2.1M |
| 2026-03-07 23:25 | Terminus-2 (2.0.0) | openai | openai | GPT-5.4 | - | 43.8% | 39 | 50 | 3600s | 14 | 0 | 1h08m | 667.0M | 905K |
| 2026-03-07 22:16 | Terminus-2 (2.0.0) | openai | openai | GPT-5.4 | - | 42.7% | 38 | 51 | 3600s | 12 | 0 | 1h08m | 707.0M | 878K |
| 2026-03-07 22:07 | OpenClaw (2026.3.1) | anthropic | anthropic | Claude Sonnet 4.6 | - | 55.1% | 49 | 40 | 3600s | 5 | 0 | 1h20m | 86.8M | 2.0M |
| 2026-03-07 20:58 | OpenClaw (2026.3.1) | anthropic | anthropic | Claude Sonnet 4.6 | - | 56.2% | 50 | 39 | 3600s | 3 | 0 | 1h09m | 78.6M | 2.0M |
| 2026-03-07 20:10 | Terminus-2 (2.0.0) | openai | openai | GPT-5.4 | - | 41.6% | 37 | 51 | 3600s | 15 | 1 | 2h05m | 759.6M | 939K |
| 2026-03-07 19:47 | OpenClaw (2026.3.1) | anthropic | anthropic | Claude Sonnet 4.6 | - | 51.7% | 46 | 43 | 3600s | 2 | 0 | 1h10m | 71.6M | 2.0M |
| 2026-03-07 18:57 | Terminus-2 (2.0.0) | openai | openai | GPT-5.4 | - | 47.2% | 42 | 46 | 3600s | 14 | 1 | 1h12m | 775.5M | 982K |
| 2026-03-07 18:15 | OpenClaw (2026.3.1) | anthropic | anthropic | Claude Sonnet 4.6 | - | 48.3% | 43 | 46 | 3600s | 6 | 0 | 1h31m | 115.2M | 2.3M |
| 2026-03-07 16:53 | Claude Code (2.1.63) | anthropic | anthropic | Claude Opus 4.6 | - | 67.4% | 60 | 28 | 3600s | 6 | 1 | 1h22m | 222.3M | 1.2M |
| 2026-03-07 15:47 | Claude Code (2.1.63) | anthropic | anthropic | Claude Opus 4.6 | - | 62.9% | 56 | 33 | 3600s | 4 | 0 | 1h05m | 195.9M | 1.6M |
| 2026-03-07 14:27 | Claude Code (2.1.63) | anthropic | anthropic | Claude Opus 4.6 | - | 58.4% | 52 | 36 | 3600s | 6 | 1 | 1h20m | 169.0M | 1.2M |
| 2026-03-07 13:18 | Claude Code (2.1.63) | anthropic | anthropic | Claude Opus 4.6 | - | 59.6% | 53 | 36 | 3600s | 7 | 0 | 1h09m | 188.9M | 1.2M |
| 2026-03-07 11:53 | Claude Code (2.1.63) | anthropic | anthropic | Claude Opus 4.6 | - | 67.4% | 60 | 29 | 3600s | 5 | 0 | 1h24m | 209.0M | 1.4M |
| 2026-03-07 10:13 | Claude Code (2.1.63) | anthropic | anthropic | Claude Sonnet 4.6 | - | 53.9% | 48 | 41 | 3600s | 12 | 0 | 1h39m | 202.1M | 2.1M |
| 2026-03-07 09:08 | Claude Code (2.1.63) | anthropic | anthropic | Claude Sonnet 4.6 | - | 57.3% | 51 | 38 | 3600s | 4 | 0 | 1h05m | 166.9M | 1.7M |
| 2026-03-07 08:02 | Claude Code (2.1.63) | anthropic | anthropic | Claude Sonnet 4.6 | - | 62.9% | 56 | 33 | 3600s | 3 | 0 | 1h05m | 185.9M | 1.8M |
| 2026-03-07 06:56 | Claude Code (2.1.63) | anthropic | anthropic | Claude Sonnet 4.6 | - | 57.3% | 51 | 38 | 3600s | 6 | 0 | 1h05m | 210.7M | 2.2M |
| 2026-03-07 04:37 | Claude Code (2.1.63) | anthropic | anthropic | Claude Sonnet 4.6 | - | 56.2% | 50 | 38 | 3600s | 5 | 1 | 2h19m | 216.0M | 2.3M |
| 2026-02-18 01:33 | Terminus-2 (2.0.0) | openai | zai-org | GLM-5-FP8 | - | 50.6% | 45 | 43 | 7200s | 13 | 1 | 4h20m | 211.5M | 1.4M |
by Wolfram Ravenwolf – who evaluates models for breakfast, builds agents at night, and preaches AI usefulness all day long.
Welcome to WolfBench – we’re just getting started. What you see here is an early preview with only a handful of models and agents tested so far. We’re continuously expanding the lineup, running fresh evals, and sharing interesting findings and insights along the way. Watch this space.
AI agents are becoming essential tools. Every week, a new model comes out and claims to be “the best at coding” or “SOTA on agentic tasks.” But what does that actually mean for you – the person who’s going to throw real work at these things?
A single score tells you almost nothing.
Most benchmarks give you one number: “Model X scored 42% on Benchmark Y.” Great. But can you rely on it? Was that a lucky run? Would it score the same tomorrow? What’s the floor – the tasks it always nails? What’s the ceiling – what it could do if the stars align?
WolfBench exists because we got tired of meaningless leaderboards. We wanted to know which model, which agent, and which settings actually deliver the best results on real agentic tasks – not just on paper, but in practice, consistently, across multiple runs.
WolfBench is an evaluation framework built on top of Terminal-Bench 2.0, a popular agentic benchmark consisting of 89 diverse real-world tasks. These aren’t just coding puzzles. They span the kind of work you’d actually ask an AI agent to do:
The key word is agentic: these tasks require the model to plan, execute shell commands, inspect results, debug failures, and iterate – just like a human developer or sysadmin would. No multiple-choice shortcuts. No toy puzzles. Real work in real sandboxed environments.
Performance is a distribution, not a point. One number can’t capture what an AI agent is truly capable of. Five numbers get a lot closer.
The union of all tasks ever solved across all runs. If the model solved task A in run 3 and task B in run 5 (but never both in the same run), both count toward the ceiling.
It tells you the theoretical maximum performance this model is capable of with a given agent – even if no single run achieves it. It reveals variance-limited tasks: solvable, but not reliably.
The highest score from any individual run.
This is the “marketing number” – but with context. The closer the best-of is to the average, the more consistent the model performs. A large gap between best-of and average means you’re rolling dice every time you run it.
The mean score across all valid runs.
This is the most commonly reported metric – and it is useful, but only with enough runs to be stable. With a single run? It’s a coin flip.
The lowest score from any individual run.
This is the opposite of best-of – the floor, the worst case. The gap between worst-of and best-of defines the full score range across all runs. A narrow range means predictable performance; a wide range means you’re rolling dice. Dashed lines on the chart mark this range visually, connecting the worst-of floor to the best-of peak.
Tasks that the model solves across all runs – the rock-solid base with zero variance.
The higher the solid base, the more dependable the agent is. These are the tasks you can confidently delegate and expect success every time. A model with a high solid base and moderate average is often more reliable in practice than one with a high average but low solid base – because you know what you’re getting.
The five metrics are shown for each model/configuration: four stacked bar segments plus the worst-of marker with dashed range lines. The spread between them tells you as much as the numbers themselves:
Performance is more complex than a single average score – and the decisions you make based on benchmarks deserve better data than that. WolfBench gives you five angles on every model and configuration, so you can form a more complete and realistic judgement of what an AI agent will actually deliver when you put it to work.
Because at the end of the day, you don’t just want to know which model scored the highest. You want to know which one you can trust.
We will continuously add models and agents to the chart, publish the traces and evals on W&B Weave, and release regular blog posts detailing interesting and insightful findings.
This benchmark offers enormous potential for discovery. For instance: Why does xhigh reasoning improve GPT 5.4’s performance while max effort degrades Opus 4.6’s results? How does Claude Code fare when running a GPT or Gemini model compared to running directly with Opus or Sonnet – or Codex with Claude or Gemini? Is a “cheap” model actually cost-effective if it consumes far more tokens than a more expensive alternative? How does quantization affect performance of local models in agentic tasks?
So many possibilities for analysis – and for posting about it! Stay tuned – and if you want to be the first to know when new results come in, follow me on X and LinkedIn.
Inference sponsored by CoreWeave: The Essential Cloud for AI.
Sandbox compute by Daytona – Secure Infrastructure for Running AI-Generated Code.
Built with Harbor for orchestration, Terminal-Bench 2.0 for tasks, and W&B Weave for tracking.
Charts and dashboards generated with marimo notebooks.
Explore the complete data and tooling suite on our WolfBench GitHub.