WolfBench (2026-04-07)

Wolfram Ravenwolf’s Five-Metric Framework · based on Terminal-Bench 2.0

One score is not enough.
Because performance is a distribution, not a point.

Most benchmarks report a single average. WolfBench shows five metrics that tell the full story – from the rock-solid base of tasks solved every time, through the average, up to the ceiling of everything ever solved – plus the best and worst single runs that frame the spread. Together, they reveal what no single number can: how consistent an AI agent truly is.
Learn more ↓

%
★ Ceiling (ever solved)▲ Best-of (peak run)∅ Average (mean score)▼ Worst-of (lowest run)■ Solid (always solved)
👁
Claude Opus 4.6GPT-5.4Claude Sonnet 4.6Gemini 3.1 Pro PreviewGPT-5.3-CodexMiniMax M2.7GLM-5 FP8Kimi K2.5 (int4)GLM-5-TurboKimi K2.5 (nvfp4)MiniMax M2.5Gemini 3 Flash PreviewNVIDIA-Nemotron-3-Super-120B-A12B-FP8gemma-4-31b-itGPT‑5.4 miniGemini 3.1 Flash Lite PreviewMistral Small 4 119B A6BGPT‑5.4 nano
T2 = Terminus-2CC = Claude CodeHA = Hermes AgentOC = OpenClaw
0%10%20%30%40%50%60%70%80%90%100%
NVIDIA-Nemotron-3-Super-120B-A12B-FP8
Run Details (230 runs)

Across these runs, 88 (99%) of the 89 tasks were solved at least once, 1 (1%) were solved every time, and 1 (1%) were never solved.

DateAgentProviderVendorModelThinkScorePassFailTimeoutTimeoutsErrDurationInOut
2026-04-06 07:52Terminus-2 (2.0.0)geminigeminiGemini 3.1 Flash Lite Preview-24.7%22663600s311h40m172.0M2.2M
2026-04-06 06:11Terminus-2 (2.0.0)geminigeminiGemini 3.1 Flash Lite Preview-28.1%25633600s211h40m174.2M1.5M
2026-04-06 05:06Terminus-2 (2.0.0)geminigeminiGemini 3.1 Flash Lite Preview-25.8%23663600s201h05m96.6M2.0M
2026-04-06 03:25Terminus-2 (2.0.0)geminigeminiGemini 3.1 Flash Lite Preview-21.3%19693600s211h40m156.2M1.9M
2026-04-06 02:22Terminus-2 (2.0.0)geminigeminiGemini 3.1 Flash Lite Preview-25.8%23663600s201h02m223.3M2.4M
2026-04-06 01:18OpenClaw (2026.3.11)googlegoogleGemini 3.1 Flash Lite Preview-20.2%18713600s501h04m239.4M837K
2026-04-05 23:37OpenClaw (2026.3.11)googlegoogleGemini 3.1 Flash Lite Preview-21.3%19693600s711h40m177.6M745K
2026-04-05 21:45OpenClaw (2026.3.11)googlegoogleGemini 3.1 Flash Lite Preview-22.5%20693600s401h51m162.3M741K
2026-04-05 20:04OpenClaw (2026.3.11)googlegoogleGemini 3.1 Flash Lite Preview-24.7%22663600s511h40m255.9M830K
2026-04-05 17:06OpenClaw (2026.3.11)googlegoogleGemini 3.1 Flash Lite Preview-25.8%23663600s602h57m123.1M697K
2026-04-05 11:05Terminus-2 (2.0.0)geminigeminiGemini 3 Flash Preview-41.6%37513600s411h40m284.7M1.2M
2026-04-05 09:59Terminus-2 (2.0.0)geminigeminiGemini 3 Flash Preview-41.6%37523600s501h05m270.6M1.2M
2026-04-05 08:19Terminus-2 (2.0.0)geminigeminiGemini 3 Flash Preview-46.1%41473600s311h40m310.4M1.3M
2026-04-05 07:15Terminus-2 (2.0.0)geminigeminiGemini 3 Flash Preview-48.3%43453600s411h03m490.5M1.6M
2026-04-05 06:09Terminus-2 (2.0.0)geminigeminiGemini 3 Flash Preview-43.8%39503600s501h05m252.4M1.2M
2026-04-05 04:28OpenClaw (2026.3.11)googlegoogleGemini 3 Flash Preview-40.4%36523600s911h40m210.1M472K
2026-04-05 02:47OpenClaw (2026.3.11)googlegoogleGemini 3 Flash Preview-36.0%32563600s711h40m377.6M653K
2026-04-05 01:06OpenClaw (2026.3.11)googlegoogleGemini 3 Flash Preview-46.1%41473600s911h40m265.0M753K
2026-04-04 23:25OpenClaw (2026.3.11)googlegoogleGemini 3 Flash Preview-40.4%36523600s711h40m210.4M670K
2026-04-04 21:44OpenClaw (2026.3.11)googlegoogleGemini 3 Flash Preview-40.4%36523600s911h40m920.6M1.2M
2026-04-04 08:43Terminus-2 (2.0.0)openroutergooglegemma-4-31b-it-32.6%29583600s1422h54m24.6M770K
2026-04-04 05:40Terminus-2 (2.0.0)openroutergooglegemma-4-31b-it-30.3%27563600s1663h03m16.1M602K
2026-04-04 00:26Terminus-2 (2.0.0)openroutergooglegemma-4-31b-it-29.5%26623600s8114h50m25.8M888K
2026-04-03 22:03Terminus-2 (2.0.0)openroutergooglegemma-4-31b-it-37.1%33533600s1632h23m25.5M753K
2026-04-03 20:12Terminus-2 (2.0.0)openroutergooglegemma-4-31b-it-27.0%24643600s2011h50m34.5M930K
2026-04-03 11:44Terminus-2 (2.0.0)geminigeminiGemini 3.1 Pro Preview-50.6%45443600s000h31m15.4M653K
2026-04-03 11:08Terminus-2 (2.0.0)geminigeminiGemini 3.1 Pro Preview-56.2%50393600s000h35m15.6M568K
2026-04-03 10:41Terminus-2 (2.0.0)geminigeminiGemini 3.1 Pro Preview-52.8%47423600s000h26m23.7M700K
2026-04-03 09:01Terminus-2 (2.0.0)geminigeminiGemini 3.1 Pro Preview-50.6%45433600s111h40m13.1M634K
2026-04-03 07:58Terminus-2 (2.0.0)geminigeminiGemini 3.1 Pro Preview-48.3%43453600s111h02m19.4M670K
2026-04-03 06:53OpenClaw (2026.3.11)googlegoogleGemini 3.1 Pro Preview-59.6%53363600s601h05m228.7M638K
2026-04-03 05:46OpenClaw (2026.3.11)googlegoogleGemini 3.1 Pro Preview-57.3%51383600s501h06m226.5M748K
2026-04-03 04:05OpenClaw (2026.3.11)googlegoogleGemini 3.1 Pro Preview-60.7%54343600s711h40m131.0M652K
2026-04-03 02:24OpenClaw (2026.3.11)googlegoogleGemini 3.1 Pro Preview-62.9%56323600s811h40m239.2M696K
2026-04-03 01:18OpenClaw (2026.3.11)googlegoogleGemini 3.1 Pro Preview-56.2%50393600s601h06m102.8M485K
2026-04-02 12:00Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.6-61.8%55313600s536h17m68.4M1.0M
2026-04-02 07:39Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.5 (nvfp4)-39.3%35523600s625h23m75.8M1.6M
2026-04-02 07:08Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.6-64.0%57323600s504h51m75.6M1.0M
2026-04-02 05:24Hermes Agent (v2026.3.30)openaiopenaiGPT-5.4-70.8%63263600s202h38m80.2M996K
2026-04-02 03:25Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.5 (nvfp4)-40.4%36523600s614h13m82.0M1.7M
2026-04-02 02:55Hermes Agent (v2026.3.30)openaiopenaiGPT-5.4-65.2%58313600s202h28m70.8M960K
2026-04-02 00:41Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.6-67.4%60293600s506h25m76.1M1.2M
2026-04-01 23:30Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.5 (nvfp4)-39.3%35533600s513h54m65.4M1.3M
2026-04-01 20:09Hermes Agent (v2026.3.30)openaiopenaiGPT-5.4-66.3%59303600s302h22m86.4M1.0M
2026-04-01 19:52Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.5 (nvfp4)-44.9%40493600s403h38m77.7M1.6M
2026-04-01 19:49Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.6-61.8%55343600s404h51m69.4M1.1M
2026-04-01 17:47Hermes Agent (v2026.3.30)openaiopenaiGPT-5.4-65.2%58303600s312h21m67.2M900K
2026-04-01 14:45Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.5 (nvfp4)-42.7%38503600s715h06m88.1M1.6M
2026-04-01 14:44Hermes Agent (v2026.3.30)openaiopenaiGPT-5.4-64.0%57313600s113h02m64.5M847K
2026-04-01 14:44Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.6-64.0%57313600s315h04m69.5M1.1M
2026-03-29 07:01OpenClaw (2026.3.11)wandbzai-orgGLM-5-FP8-39.3%35453600s692h57m133.6M1.1M
2026-03-29 04:05OpenClaw (2026.3.11)wandbzai-orgGLM-5-FP8-37.1%33513600s152h55m91.3M923K
2026-03-29 01:00OpenClaw (2026.3.11)wandbzai-orgGLM-5-FP8-38.2%34503600s353h04m104.7M861K
2026-03-27 19:54OpenClaw (2026.3.11)wandbzai-orgGLM-5-FP8-31.5%28533600s583h07m102.7M923K
2026-03-27 16:16OpenClaw (2026.3.11)wandbzai-orgGLM-5-FP8-37.1%33473600s293h37m90.2M797K
2026-03-27 13:20OpenClaw (2026.3.11)wandbMiniMaxAIMiniMax M2.5-42.7%38473600s042h55m69.4M984K
2026-03-27 11:08OpenClaw (2026.3.11)wandbMiniMaxAIMiniMax M2.5-37.1%33493600s172h12m72.7M1.0M
2026-03-27 08:24OpenClaw (2026.3.11)wandbMiniMaxAIMiniMax M2.5-37.1%33503600s062h42m66.5M885K
2026-03-27 06:53Terminus-2 (2.0.0)wandbMiniMaxAIMiniMax M2.5-50.6%45423600s2421h31m74.9M1.4M
2026-03-27 04:47Terminus-2 (2.0.0)wandbMiniMaxAIMiniMax M2.5-43.8%39453600s2552h05m114.6M1.6M
2026-03-27 02:58Terminus-2 (2.0.0)wandbMiniMaxAIMiniMax M2.5-49.4%44433600s1721h48m84.9M1.5M
2026-03-26 12:31Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (nvfp4)-47.2%42453600s721h47m306.4M2.0M
2026-03-26 09:37OpenClaw (2026.3.11)wandbMiniMaxAIMiniMax M2.5-33.7%30533600s362h51m72.7M969K
2026-03-26 07:14OpenClaw (2026.3.11)wandbMiniMaxAIMiniMax M2.5-32.6%29523600s182h22m68.1M996K
2026-03-26 06:07Terminus-2 (2.0.0)wandbMiniMaxAIMiniMax M2.5-41.6%37503600s3121h06m64.1M1.5M
2026-03-26 04:26Terminus-2 (2.0.0)wandbMiniMaxAIMiniMax M2.5-49.4%44423600s2231h41m74.7M1.4M
2026-03-20 22:45Terminus-2 (2.0.0)wandbzai-orgGLM-5-FP8-52.8%47427200s2003h15m147.7M2.1M
2026-03-20 16:16Terminus-2 (2.0.0)wandbzai-orgGLM-5-FP8-47.2%42457200s1623h39m127.8M2.1M
2026-03-20 06:43Terminus-2 (2.0.0)openrouterminimaxMiniMax M2.7-49.4%44453600s1801h33m245.0M2.4M
2026-03-20 03:45Terminus-2 (2.0.0)openrouterminimaxMiniMax M2.7-55.1%49393600s1612h57m337.1M2.5M
2026-03-20 02:30OpenClaw (2026.3.11)openrouterminimaxMiniMax M2.7-49.4%44453600s701h14m135.9M2.4M
2026-03-20 00:17OpenClaw (2026.3.11)openrouterminimaxMiniMax M2.7-48.3%43453600s612h12m104.3M2.2M
2026-03-19 13:39Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 nano-20.2%18713600s3202h01m1321.4M2.6M
2026-03-19 12:31Terminus-2 (2.0.0)openrouterminimaxMiniMax M2.7-52.8%47423600s1501h31m317.7M2.7M
2026-03-19 12:08Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 nano-23.6%21683600s2801h31m1076.7M2.2M
2026-03-19 10:50Terminus-2 (2.0.0)openrouterminimaxMiniMax M2.7-47.2%42463600s1811h40m249.8M2.5M
2026-03-19 10:43Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 nano-23.6%21683600s2101h24m1085.3M2.0M
2026-03-19 09:29Terminus-2 (2.0.0)mistralmistralMistral Small 4 119B A6B-25.8%23593600s171h54m147.8M1.1M
2026-03-19 09:19Terminus-2 (2.0.0)openrouterminimaxMiniMax M2.7-55.1%49403600s1901h30m192.0M2.4M
2026-03-19 09:18Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 mini-27.0%24653600s2401h24m845.5M1.3M
2026-03-19 08:13Terminus-2 (2.0.0)mistralmistralMistral Small 4 119B A6B-21.3%19703600s401h15m455.3M1.6M
2026-03-19 07:38OpenClaw (2026.3.11)openrouterminimaxMiniMax M2.7-46.1%41473600s411h40m100.8M2.3M
2026-03-19 07:37Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 mini-25.8%23653600s2111h40m810.4M1.3M
2026-03-19 06:59Terminus-2 (2.0.0)mistralmistralMistral Small 4 119B A6B-23.6%21683600s401h13m232.2M1.5M
2026-03-19 05:57OpenClaw (2026.3.11)openrouterminimaxMiniMax M2.7-42.7%38503600s611h40m113.1M2.5M
2026-03-19 05:57Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 mini-25.8%23663600s1701h39m847.8M1.3M
2026-03-19 05:53OpenClaw (2026.3.11)mistralmistralMistral Small 4 119B A6B-18.0%16723600s411h05m110.5M772K
2026-03-19 04:56OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 nano-12.4%11783600s101h00m25.2M156K
2026-03-19 04:47OpenClaw (2026.3.11)mistralmistralMistral Small 4 119B A6B-16.9%15743600s601h05m120.9M842K
2026-03-19 04:11OpenClaw (2026.3.11)openrouterminimaxMiniMax M2.7-41.6%37513600s311h45m126.5M2.3M
2026-03-19 03:54OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 nano-13.5%12773600s101h02m19.4M143K
2026-03-19 03:24OpenClaw (2026.3.11)mistralmistralMistral Small 4 119B A6B-15.7%14753600s701h23m115.9M758K
2026-03-19 02:53OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 nano-16.9%15743600s101h00m13.4M123K
2026-03-18 08:05OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 mini-10.1%9803600s201h02m20.3M170K
2026-03-18 07:03OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 mini-14.6%13763600s301h02m16.7M159K
2026-03-18 06:01OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 mini-14.6%13763600s201h02m21.0M164K
2026-03-18 04:58OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 mini-18.0%16733600s101h02m19.4M162K
2026-03-18 03:56OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 mini-13.5%12773600s101h02m16.9M155K
2026-03-16 13:58Terminus-2 (2.0.0)openrouterz-aiGLM-5-Turbo-49.4%44433600s1322h15m361.5M2.7M
2026-03-16 11:53Terminus-2 (2.0.0)openrouterz-aiGLM-5-Turbo-46.1%41483600s1402h03m285.8M2.5M
2026-03-16 10:27OpenClaw (2026.3.11)openrouterz-aiGLM-5-Turbo-47.2%42473600s701h25m117.6M3.4M
2026-03-16 09:17OpenClaw (2026.3.11)openrouterz-aiGLM-5-Turbo-46.1%41483600s601h10m65.6M2.5M
2026-03-16 07:45OpenClaw (2026.3.11)openrouterz-aiGLM-5-Turbo-47.2%42473600s1001h31m72.0M2.8M
2026-03-16 06:04OpenClaw (2026.3.11)openrouterz-aiGLM-5-Turbo-49.4%44443600s911h40m87.2M2.7M
2026-03-16 04:23OpenClaw (2026.3.11)openrouterz-aiGLM-5-Turbo-43.8%39493600s611h40m116.1M3.4M
2026-03-16 01:49Terminus-2 (2.0.0)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8-31.5%28613600s2102h01m153.9M4.0M
2026-03-15 23:55Terminus-2 (2.0.0)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8-38.2%34553600s1901h54m150.4M3.9M
2026-03-15 22:13Terminus-2 (2.0.0)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8-31.5%28613600s1601h41m132.0M3.9M
2026-03-15 20:17Terminus-2 (2.0.0)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8-38.2%34553600s1901h55m151.3M4.0M
2026-03-15 18:06Terminus-2 (2.0.0)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8-39.3%35543600s2202h10m177.4M4.3M
2026-03-15 01:29OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.6max56.2%50383600s711h40m75.5M1.4M
2026-03-14 23:48OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.6max52.8%47413600s1011h40m87.4M1.7M
2026-03-14 22:09OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.6max59.6%53363600s801h39m76.4M1.7M
2026-03-14 20:31OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.6max58.4%52373600s701h37m90.1M1.6M
2026-03-14 19:40OpenClaw (2026.3.1)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8-19.1%17713600s411h40m72.6M765K
2026-03-14 18:50OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.6max59.6%53353600s511h40m100.0M1.9M
2026-03-14 18:18OpenClaw (2026.3.1)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8-23.6%21683600s701h22m83.9M1.0M
2026-03-14 17:45Claude Code (2.1.75)anthropicanthropicClaude Opus 4.6max60.7%54343600s511h04m146.6M1.5M
2026-03-14 17:09OpenClaw (2026.3.1)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8-16.9%15743600s401h08m54.7M773K
2026-03-14 16:38Claude Code (2.1.75)anthropicanthropicClaude Opus 4.6max57.3%51373600s911h07m135.0M1.5M
2026-03-14 15:33OpenClaw (2026.3.1)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8-21.3%19693600s811h35m75.5M961K
2026-03-14 15:20Claude Code (2.1.75)anthropicanthropicClaude Opus 4.6max60.7%54343600s611h17m132.5M1.8M
2026-03-14 14:20OpenClaw (2026.3.1)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8-20.2%18703600s411h12m62.0M967K
2026-03-14 14:05Claude Code (2.1.75)anthropicanthropicClaude Opus 4.6max60.7%54343600s711h15m146.6M1.6M
2026-03-14 12:32Claude Code (2.1.75)anthropicanthropicClaude Opus 4.6max58.4%52363600s911h32m176.0M1.4M
2026-03-14 10:34Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6max55.1%49393600s2111h57m77.8M2.6M
2026-03-14 08:48Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6max58.4%52363600s1611h45m61.7M2.5M
2026-03-14 07:02Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6max60.7%54353600s1601h45m82.0M2.3M
2026-03-14 04:57Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6max61.8%55343600s1502h04m75.0M2.3M
2026-03-14 04:40Claude Code (2.1.63)openrouteropenaiGPT-5.4xhigh39.3%35543600s701h04m--
2026-03-14 03:22Claude Code (2.1.63)openrouteropenaiGPT-5.4xhigh44.9%40483600s1011h18m--
2026-03-14 02:54Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6max60.7%54353600s1902h02m73.6M2.3M
2026-03-14 02:06Claude Code (2.1.63)openrouteropenaiGPT-5.4xhigh44.9%40493600s1001h15m--
2026-03-14 01:02Claude Code (2.1.63)openrouteropenaiGPT-5.4xhigh50.6%45443600s501h04m--
2026-03-13 23:48Claude Code (2.1.63)openrouteropenaiGPT-5.4xhigh49.4%44443600s1211h13m--
2026-03-12 21:45OpenClaw (2026.3.11)openaiopenaiGPT-5.4-59.6%53353600s1211h40m57.9M577K
2026-03-12 20:33OpenClaw (2026.3.11)openaiopenaiGPT-5.4-61.8%55343600s1101h10m81.0M613K
2026-03-12 19:23OpenClaw (2026.3.11)openaiopenaiGPT-5.4-57.3%51383600s1001h10m70.8M602K
2026-03-12 18:18OpenClaw (2026.3.11)openaiopenaiGPT-5.4-59.6%53363600s1001h05m67.8M618K
2026-03-12 17:08OpenClaw (2026.3.11)openaiopenaiGPT-5.4-66.3%59303600s1001h09m79.2M603K
2026-03-12 12:49OpenClaw (2026.3.11)openaiopenaiGPT-5.4xhigh70.8%63263600s1301h12m141.0M1.7M
2026-03-12 11:27OpenClaw (2026.3.11)openaiopenaiGPT-5.4xhigh71.9%64253600s1101h21m135.4M1.7M
2026-03-12 10:27Terminus-2 (2.0.0)openaiopenaiGPT-5.4xhigh67.4%60283600s1112h15m14.7M6.9M
2026-03-12 10:16OpenClaw (2026.3.11)openaiopenaiGPT-5.4xhigh70.8%63263600s1301h10m147.5M1.8M
2026-03-12 09:03OpenClaw (2026.3.11)openaiopenaiGPT-5.4xhigh69.7%62273600s1201h12m156.4M1.9M
2026-03-12 08:39Terminus-2 (2.0.0)openaiopenaiGPT-5.4xhigh73.0%65243600s1101h47m13.5M6.3M
2026-03-12 07:38OpenClaw (2026.3.11)openaiopenaiGPT-5.4xhigh71.9%64253600s1001h25m145.6M1.6M
2026-03-12 06:17Terminus-2 (2.0.0)openaiopenaiGPT-5.4xhigh64.0%57313600s1012h21m12.4M6.1M
2026-03-12 04:57Terminus-2 (2.0.0)openaiopenaiGPT-5.4xhigh70.8%63263600s801h19m10.3M5.5M
2026-03-12 03:25Terminus-2 (2.0.0)openaiopenaiGPT-5.4xhigh69.7%62273600s1101h31m13.3M5.7M
2026-03-10 12:31Terminus-2 (2.0.0)openaiopenaiKimi K2.5 (nvfp4)-47.2%42473600s1201h27m114.4M1.8M
2026-03-10 11:30OpenClaw (2026.3.1)openaiopenaiGPT-5.3-Codex-53.9%48413600s501h04m33.1M360K
2026-03-10 10:15Terminus-2 (2.0.0)openaiopenaiKimi K2.5 (nvfp4)-49.4%44443600s1112h15m140.6M2.2M
2026-03-10 09:49OpenClaw (2026.3.1)openaiopenaiGPT-5.3-Codex-55.1%49393600s811h40m31.8M339K
2026-03-10 08:43OpenClaw (2026.3.1)openaiopenaiGPT-5.3-Codex-56.2%50383600s711h05m35.6M356K
2026-03-10 08:42Terminus-2 (2.0.0)openaiopenaiKimi K2.5 (nvfp4)-46.1%41483600s1301h32m138.5M2.0M
2026-03-10 07:38OpenClaw (2026.3.1)openaiopenaiGPT-5.3-Codex-56.2%50393600s501h04m31.6M371K
2026-03-10 06:29OpenClaw (2026.3.1)openaiopenaiGPT-5.3-Codex-53.9%48413600s801h09m30.3M351K
2026-03-10 06:25Terminus-2 (2.0.0)openaiopenaiKimi K2.5 (nvfp4)-46.1%41473600s1312h17m116.0M2.1M
2026-03-10 05:24Claude Code (2.1.63)openrouteropenaiGPT-5.3-Codex-50.6%45443600s801h04m--
2026-03-10 04:58OpenClaw (2026.3.1)customcustomKimi K2.5 (nvfp4)-37.1%33563600s1401h26m181.2M1.5M
2026-03-10 04:18Claude Code (2.1.63)openrouteropenaiGPT-5.3-Codex-53.9%48413600s601h05m--
2026-03-10 03:33OpenClaw (2026.3.1)customcustomKimi K2.5 (nvfp4)-38.2%34553600s1001h25m144.8M1.3M
2026-03-10 03:14Claude Code (2.1.63)openrouteropenaiGPT-5.3-Codex-51.7%46433600s601h04m--
2026-03-10 02:10Claude Code (2.1.63)openrouteropenaiGPT-5.3-Codex-55.1%49403600s501h04m--
2026-03-10 01:37OpenClaw (2026.3.1)customcustomKimi K2.5 (nvfp4)-33.7%30593600s901h55m167.6M1.4M
2026-03-10 01:06Claude Code (2.1.63)openrouteropenaiGPT-5.3-Codex-49.4%44453600s601h03m--
2026-03-10 00:20OpenClaw (2026.3.1)customcustomKimi K2.5 (nvfp4)-38.2%34553600s1001h16m237.4M1.6M
2026-03-09 23:12OpenClaw (2026.3.1)customcustomKimi K2.5 (nvfp4)-37.1%33553600s1011h07m92.9M1.1M
2026-03-09 22:57Claude Code (2.1.63)openrouteropenaiGPT-5.4-46.1%41483600s501h03m--
2026-03-09 21:38Claude Code (2.1.63)openrouteropenaiGPT-5.4-51.7%46423600s911h18m--
2026-03-09 20:27Claude Code (2.1.63)openrouteropenaiGPT-5.4-48.3%43463600s701h10m--
2026-03-09 19:22Claude Code (2.1.63)openrouteropenaiGPT-5.4-51.7%46423600s511h05m--
2026-03-09 18:12Claude Code (2.1.63)openrouteropenaiGPT-5.4-43.8%39503600s501h09m--
2026-03-09 14:05Terminus-2 (2.0.0)openaiopenaiGPT-5.3-Codex-38.2%34553600s1301h14m457.8M622K
2026-03-09 12:59Terminus-2 (2.0.0)openaiopenaiGPT-5.3-Codex-41.6%37523600s1301h05m478.0M570K
2026-03-09 11:33Terminus-2 (2.0.0)openaiopenaiGPT-5.3-Codex-38.2%34553600s1701h25m628.9M671K
2026-03-09 09:42Terminus-2 (2.0.0)openaiopenaiGPT-5.3-Codex-39.3%35533600s1111h50m515.0M646K
2026-03-09 08:09Terminus-2 (2.0.0)openaiopenaiGPT-5.3-Codex-38.2%34553600s1301h32m480.1M687K
2026-03-09 07:46OpenClaw (2026.3.1)openaiopenaiGPT-5.4-28.1%25643600s701h04m29.5M288K
2026-03-09 07:00Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6-69.7%62263600s811h08m146.4M1.6M
2026-03-09 06:38OpenClaw (2026.3.1)openaiopenaiGPT-5.4-32.6%29603600s1101h07m35.7M345K
2026-03-09 05:53Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6-71.9%64253600s501h06m151.5M1.5M
2026-03-09 05:33OpenClaw (2026.3.1)openaiopenaiGPT-5.4-31.5%28613600s701h04m34.4M326K
2026-03-09 04:27OpenClaw (2026.3.1)openaiopenaiGPT-5.4-30.3%27623600s701h06m42.8M322K
2026-03-09 04:13Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6-70.8%63253600s811h40m175.6M1.7M
2026-03-09 03:21OpenClaw (2026.3.1)openaiopenaiGPT-5.4-29.2%26633600s501h05m29.6M333K
2026-03-09 03:07Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6-68.5%61283600s401h05m153.0M1.4M
2026-03-08 19:15Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (int4)-51.7%46423600s1411h26m204.5M1.7M
2026-03-08 17:46Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (int4)-48.3%43463600s1201h29m193.4M1.7M
2026-03-08 16:04Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (int4)-46.1%41483600s1401h41m236.0M1.7M
2026-03-08 14:26Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6-75.3%67223600s301h03m155.2M1.4M
2026-03-08 13:51Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (int4)-49.4%44443600s1312h12m195.4M1.7M
2026-03-08 12:24Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (int4)-46.1%41483600s1501h26m197.7M1.7M
2026-03-08 12:12Terminus-2 (2.0.0)anthropicanthropicClaude Sonnet 4.6-61.8%55343600s1001h07m259.5M2.2M
2026-03-08 10:25Terminus-2 (2.0.0)anthropicanthropicClaude Sonnet 4.6-59.6%53363600s701h46m216.2M1.9M
2026-03-08 09:53OpenClaw (2026.3.1)wandbmoonshotaiKimi K2.5 (int4)-37.1%33553600s1312h30m192.2M1.6M
2026-03-08 09:18Terminus-2 (2.0.0)anthropicanthropicClaude Sonnet 4.6-62.9%56333600s1301h06m192.5M2.0M
2026-03-08 08:31OpenClaw (2026.3.1)wandbmoonshotaiKimi K2.5 (int4)-39.3%35533600s1311h21m188.4M1.6M
2026-03-08 08:10Terminus-2 (2.0.0)anthropicanthropicClaude Sonnet 4.6-62.9%56333600s601h08m189.4M1.9M
2026-03-08 07:14OpenClaw (2026.3.1)wandbmoonshotaiKimi K2.5 (int4)-34.8%31583600s701h16m228.8M1.6M
2026-03-08 06:21Terminus-2 (2.0.0)anthropicanthropicClaude Sonnet 4.6-64.0%57323600s801h48m151.2M1.9M
2026-03-08 05:38OpenClaw (2026.3.1)wandbmoonshotaiKimi K2.5 (int4)-44.9%40493600s1101h35m176.6M1.4M
2026-03-08 05:12OpenClaw (2026.3.1)anthropicanthropicClaude Opus 4.6-57.3%51383600s1001h09m97.6M1.4M
2026-03-08 04:09OpenClaw (2026.3.1)wandbmoonshotaiKimi K2.5 (int4)-38.2%34543600s611h29m171.4M1.6M
2026-03-08 04:03OpenClaw (2026.3.1)anthropicanthropicClaude Opus 4.6-57.3%51383600s701h08m78.6M1.3M
2026-03-08 02:53OpenClaw (2026.3.1)anthropicanthropicClaude Opus 4.6-56.2%50393600s1001h10m74.2M1.3M
2026-03-08 01:44OpenClaw (2026.3.1)anthropicanthropicClaude Opus 4.6-58.4%52373600s801h08m73.8M1.3M
2026-03-08 00:37OpenClaw (2026.3.1)anthropicanthropicClaude Opus 4.6-58.4%52373600s501h07m83.4M1.3M
2026-03-08 00:33Terminus-2 (2.0.0)openaiopenaiGPT-5.4-44.9%40483600s1212h06m726.8M1.0M
2026-03-07 23:28OpenClaw (2026.3.1)anthropicanthropicClaude Sonnet 4.6-51.7%46413600s321h08m95.3M2.1M
2026-03-07 23:25Terminus-2 (2.0.0)openaiopenaiGPT-5.4-43.8%39503600s1401h08m667.0M905K
2026-03-07 22:16Terminus-2 (2.0.0)openaiopenaiGPT-5.4-42.7%38513600s1201h08m707.0M878K
2026-03-07 22:07OpenClaw (2026.3.1)anthropicanthropicClaude Sonnet 4.6-55.1%49403600s501h20m86.8M2.0M
2026-03-07 20:58OpenClaw (2026.3.1)anthropicanthropicClaude Sonnet 4.6-56.2%50393600s301h09m78.6M2.0M
2026-03-07 20:10Terminus-2 (2.0.0)openaiopenaiGPT-5.4-41.6%37513600s1512h05m759.6M939K
2026-03-07 19:47OpenClaw (2026.3.1)anthropicanthropicClaude Sonnet 4.6-51.7%46433600s201h10m71.6M2.0M
2026-03-07 18:57Terminus-2 (2.0.0)openaiopenaiGPT-5.4-47.2%42463600s1411h12m775.5M982K
2026-03-07 18:15OpenClaw (2026.3.1)anthropicanthropicClaude Sonnet 4.6-48.3%43463600s601h31m115.2M2.3M
2026-03-07 16:53Claude Code (2.1.63)anthropicanthropicClaude Opus 4.6-67.4%60283600s611h22m222.3M1.2M
2026-03-07 15:47Claude Code (2.1.63)anthropicanthropicClaude Opus 4.6-62.9%56333600s401h05m195.9M1.6M
2026-03-07 14:27Claude Code (2.1.63)anthropicanthropicClaude Opus 4.6-58.4%52363600s611h20m169.0M1.2M
2026-03-07 13:18Claude Code (2.1.63)anthropicanthropicClaude Opus 4.6-59.6%53363600s701h09m188.9M1.2M
2026-03-07 11:53Claude Code (2.1.63)anthropicanthropicClaude Opus 4.6-67.4%60293600s501h24m209.0M1.4M
2026-03-07 10:13Claude Code (2.1.63)anthropicanthropicClaude Sonnet 4.6-53.9%48413600s1201h39m202.1M2.1M
2026-03-07 09:08Claude Code (2.1.63)anthropicanthropicClaude Sonnet 4.6-57.3%51383600s401h05m166.9M1.7M
2026-03-07 08:02Claude Code (2.1.63)anthropicanthropicClaude Sonnet 4.6-62.9%56333600s301h05m185.9M1.8M
2026-03-07 06:56Claude Code (2.1.63)anthropicanthropicClaude Sonnet 4.6-57.3%51383600s601h05m210.7M2.2M
2026-03-07 04:37Claude Code (2.1.63)anthropicanthropicClaude Sonnet 4.6-56.2%50383600s512h19m216.0M2.3M
2026-02-18 01:33Terminus-2 (2.0.0)openaizai-orgGLM-5-FP8-50.6%45437200s1314h20m211.5M1.4M

About WolfBench

Wolfram Ravenwolfby Wolfram Ravenwolf – who evaluates models for breakfast, builds agents at night, and preaches AI usefulness all day long.

Welcome to WolfBench – we’re just getting started. What you see here is an early preview with only a handful of models and agents tested so far. We’re continuously expanding the lineup, running fresh evals, and sharing interesting findings and insights along the way. Watch this space.

AI agents are becoming essential tools. Every week, a new model comes out and claims to be “the best at coding” or “SOTA on agentic tasks.” But what does that actually mean for you – the person who’s going to throw real work at these things?

A single score tells you almost nothing.

Most benchmarks give you one number: “Model X scored 42% on Benchmark Y.” Great. But can you rely on it? Was that a lucky run? Would it score the same tomorrow? What’s the floor – the tasks it always nails? What’s the ceiling – what it could do if the stars align?

WolfBench exists because we got tired of meaningless leaderboards. We wanted to know which model, which agent, and which settings actually deliver the best results on real agentic tasks – not just on paper, but in practice, consistently, across multiple runs.

What is it?

WolfBench is an evaluation framework built on top of Terminal-Bench 2.0, a popular agentic benchmark consisting of 89 diverse real-world tasks. These aren’t just coding puzzles. They span the kind of work you’d actually ask an AI agent to do:

The key word is agentic: these tasks require the model to plan, execute shell commands, inspect results, debug failures, and iterate – just like a human developer or sysadmin would. No multiple-choice shortcuts. No toy puzzles. Real work in real sandboxed environments.

Why WolfBench is different

The Five-Metric Framework

Performance is a distribution, not a point. One number can’t capture what an AI agent is truly capable of. Five numbers get a lot closer.

★ Ceiling: What’s theoretically possible?

The union of all tasks ever solved across all runs. If the model solved task A in run 3 and task B in run 5 (but never both in the same run), both count toward the ceiling.

It tells you the theoretical maximum performance this model is capable of with a given agent – even if no single run achieves it. It reveals variance-limited tasks: solvable, but not reliably.

▲ Best-of: What’s the peak in a single run?

The highest score from any individual run.

This is the “marketing number” – but with context. The closer the best-of is to the average, the more consistent the model performs. A large gap between best-of and average means you’re rolling dice every time you run it.

∅ Average: What can you normally expect?

The mean score across all valid runs.

This is the most commonly reported metric – and it is useful, but only with enough runs to be stable. With a single run? It’s a coin flip.

▼ Worst-of: How bad can a single run get?

The lowest score from any individual run.

This is the opposite of best-of – the floor, the worst case. The gap between worst-of and best-of defines the full score range across all runs. A narrow range means predictable performance; a wide range means you’re rolling dice. Dashed lines on the chart mark this range visually, connecting the worst-of floor to the best-of peak.

■ Solid: What does it always get right?

Tasks that the model solves across all runs – the rock-solid base with zero variance.

The higher the solid base, the more dependable the agent is. These are the tasks you can confidently delegate and expect success every time. A model with a high solid base and moderate average is often more reliable in practice than one with a high average but low solid base – because you know what you’re getting.

Reading the Chart

The five metrics are shown for each model/configuration: four stacked bar segments plus the worst-of marker with dashed range lines. The spread between them tells you as much as the numbers themselves:

The Bottom Line

Performance is more complex than a single average score – and the decisions you make based on benchmarks deserve better data than that. WolfBench gives you five angles on every model and configuration, so you can form a more complete and realistic judgement of what an AI agent will actually deliver when you put it to work.

Because at the end of the day, you don’t just want to know which model scored the highest. You want to know which one you can trust.

What’s Next

We will continuously add models and agents to the chart, publish the traces and evals on W&B Weave, and release regular blog posts detailing interesting and insightful findings.

This benchmark offers enormous potential for discovery. For instance: Why does xhigh reasoning improve GPT 5.4’s performance while max effort degrades Opus 4.6’s results? How does Claude Code fare when running a GPT or Gemini model compared to running directly with Opus or Sonnet – or Codex with Claude or Gemini? Is a “cheap” model actually cost-effective if it consumes far more tokens than a more expensive alternative? How does quantization affect performance of local models in agentic tasks?

So many possibilities for analysis – and for posting about it! Stay tuned – and if you want to be the first to know when new results come in, follow me on X and LinkedIn.

Inference sponsored by CoreWeave: The Essential Cloud for AI.
Sandbox compute by Daytona – Secure Infrastructure for Running AI-Generated Code.
Built with Harbor for orchestration, Terminal-Bench 2.0 for tasks, and W&B Weave for tracking.
Charts and dashboards generated with marimo notebooks.
Explore the complete data and tooling suite on our WolfBench GitHub.