WolfBench (2026-06-02)

Wolfram Ravenwolf’s Five-Metric Framework · based on Terminal-Bench 2.0

One score is not enough.
Because performance is a distribution, not a point.

Most benchmarks report a single average. WolfBench shows five metrics that tell the full story – from the rock-solid base of tasks solved every time, through the average, up to the ceiling of everything ever solved – plus the best and worst single runs that frame the spread. Together, they reveal what no single number can: how consistent an AI agent truly is.
Learn more ↓

%
★ Ceiling (ever solved)▲ Best-of (peak run)∅ Average (mean score)▼ Worst-of (lowest run)■ Solid (always solved)
👁
GPT-5.5Gemini 3.5 FlashClaude Opus 4.7Claude Opus 4.6GPT-5.4Claude Sonnet 4.6Kimi K2.6 [W&B]Kimi K2.6 [Moonshot AI]DeepSeek-V4-Pro [W&B]MiniMax M2.7Gemini 3.1 Pro PreviewDeepSeek-V4-Flash [W&B]Kimi K2.5 (int4) [W&B]GLM-5-TurboKimi K2.5 (nvfp4) [W&B]GLM-5-FP8 [W&B]MiniMax M2.5 [W&B]Gemini 3 Flash PreviewGLM-5.1 [W&B]GPT-5.3-CodexNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]Gemma 4 31B [W&B]GPT‑5.4 miniGemini 3.1 Flash Lite PreviewMistral Small 4 119B A6BGPT‑5.4 nano
T2 = Terminus-2CC = Claude CodeHA = Hermes AgentOC = OpenClawCA = Cursor
0%10%20%30%40%50%60%70%80%90%100%
NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]
Run Details (375 runs)

Across these runs, 88 (99%) of the 89 tasks were solved at least once, 0 (0%) were solved every time, and 1 (1%) were never solved.

DateAgentProviderVendorModelThinkScorePassFailTimeoutTimeoutsErrDurationInOutTotal
2026-05-28 01:28OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashlow73.0%65233600s612h12m204.3M1.8M206.1M
2026-05-27 22:50OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashlow66.3%59303600s402h37m175.5M1.8M177.3M
2026-05-27 19:09OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashlow68.5%61283600s503h40m165.0M1.7M166.7M
2026-05-27 16:15OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashlow71.9%64243600s712h54m233.7M1.8M235.4M
2026-05-27 13:11OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashlow67.4%60293600s703h03m355.3M2.4M357.7M
2026-05-27 06:24Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashlow65.2%58313600s001h25m132.7M625K133.3M
2026-05-27 05:16Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashlow58.4%52373600s001h08m119.6M555K120.2M
2026-05-27 03:56Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashlow67.4%60293600s001h19m136.5M504K137.0M
2026-05-27 02:44Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashlow62.9%56333600s001h12m148.3M597K148.9M
2026-05-27 01:35Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashlow66.3%59303600s001h08m122.1M553K122.7M
2026-05-26 13:31OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashhigh70.8%63263600s402h28m221.9M1.9M223.8M
2026-05-26 10:25OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashhigh73.0%65233600s613h05m276.3M1.8M278.1M
2026-05-26 06:45OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashhigh80.9%72173600s803h40m257.8M1.9M259.7M
2026-05-26 04:22OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashhigh71.9%64253600s502h22m175.5M1.8M177.3M
2026-05-26 00:39OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashhigh65.2%58313600s503h42m233.7M1.8M235.5M
2026-05-25 09:27Terminus-2 (2.0.0)geminigeminiGemini 3.5 Flashhigh68.5%61283600s101h06m541.4M2.9M544.4M
2026-05-25 08:22Terminus-2 (2.0.0)geminigeminiGemini 3.5 Flashhigh69.7%62273600s401h05m801.8M3.9M805.7M
2026-05-25 07:15Terminus-2 (2.0.0)geminigeminiGemini 3.5 Flashhigh69.7%62273600s201h06m485.4M3.1M488.5M
2026-05-25 05:40Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashhigh69.7%62273600s001h35m176.5M751K177.3M
2026-05-25 04:13Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashhigh70.8%63263600s001h27m190.3M728K191.0M
2026-05-25 02:55Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashhigh70.8%63263600s001h17m178.1M774K178.8M
2026-05-25 01:27Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashhigh68.5%61283600s001h27m181.0M682K181.6M
2026-05-25 00:56Terminus-2 (2.0.0)geminigeminiGemini 3.5 Flashhigh74.2%66233600s000h31m400.8M2.8M403.6M
2026-05-24 16:15Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashhigh71.9%64253600s001h33m186.1M724K186.8M
2026-05-24 15:09Terminus-2 (2.0.0)geminigeminiGemini 3.5 Flashhigh75.3%67223600s301h06m861.9M3.0M864.9M
2026-05-24 01:13OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashmedium66.3%59303600s502h13m180.2M1.5M181.7M
2026-05-23 22:30OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashmedium62.9%56323600s712h43m285.6M2.0M287.5M
2026-05-23 19:46OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashmedium73.0%65243600s702h44m160.7M1.7M162.3M
2026-05-23 17:25OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashmedium71.9%64253600s602h21m228.9M1.9M230.8M
2026-05-23 15:04OpenClaw (2026.4.23)googlegoogleGemini 3.5 Flashmedium66.3%59303600s502h20m154.4M1.5M155.9M
2026-05-21 10:01Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashmedium75.3%67213600s012h08m167.6M764K168.3M
2026-05-21 07:56Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashmedium67.4%60293600s002h04m169.9M742K170.6M
2026-05-21 05:55Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashmedium70.8%63263600s002h00m198.7M716K199.5M
2026-05-21 03:55Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashmedium66.3%59283600s021h59m169.4M629K170.0M
2026-05-21 01:55Hermes Agent (v2026.5.16)googlegoogleGemini 3.5 Flashmedium69.7%62273600s001h59m168.0M715K168.7M
2026-05-20 13:37Terminus-2 (2.0.0)geminigeminiGemini 3.5 Flashmedium59.6%53343600s121h10m541.6M2.4M544.0M
2026-05-20 03:05Terminus-2 (2.0.0)geminigeminiGemini 3.5 Flashmedium70.8%63233600s231h11m774.4M3.4M777.7M
2026-05-20 02:04Terminus-2 (2.0.0)geminigeminiGemini 3.5 Flashmedium71.9%64233600s021h00m315.4M2.3M317.7M
2026-05-19 22:53Terminus-2 (2.0.0)geminigeminiGemini 3.5 Flashmedium71.9%64233600s222h00m682.9M2.9M685.8M
2026-05-19 19:27Terminus-2 (2.0.0)geminigeminiGemini 3.5 Flashmedium78.7%70183600s211h05m495.9M2.5M498.4M
2026-05-19 00:29Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.6 [W&B]-60.7%54343600s911h30m212.0M4.6M216.6M
2026-05-16 12:03Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.6 [W&B]-57.3%51373600s811h01m243.5M5.9M249.4M
2026-05-16 09:02Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.6 [W&B]-59.6%53363600s503h00m77.8M5.1M82.9M
2026-05-16 08:01Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.6 [W&B]-62.9%56323600s811h01m193.1M5.9M198.9M
2026-05-16 04:31Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.6 [W&B]-48.3%43463600s503h29m77.7M5.7M83.4M
2026-05-16 03:29Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.6 [W&B]-53.9%48413600s1001h01m228.5M5.1M233.7M
2026-05-16 00:40Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.6 [W&B]-60.7%54353600s602h49m76.6M4.9M81.5M
2026-05-15 23:39Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.6 [W&B]-66.3%59293600s711h01m191.8M5.4M197.2M
2026-05-15 20:38Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.6 [W&B]-49.4%44453600s403h00m78.8M4.8M83.6M
2026-05-15 13:10Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.6 [W&B]-61.8%55343600s503h08m84.0M5.7M89.8M
2026-05-14 12:49Terminus-2 (2.0.0)wandbdeepseek-aiDeepSeek-V4-Pro [W&B]-59.6%53343600s1921h42m154.8M2.0M156.8M
2026-05-14 09:14Hermes Agent (v2026.3.30)wandbdeepseek-aiDeepSeek-V4-Pro [W&B]-32.6%29593600s913h34m65.8M798K66.6M
2026-05-14 07:01Terminus-2 (2.0.0)wandbdeepseek-aiDeepSeek-V4-Pro [W&B]-56.2%50383600s2212h12m143.3M1.9M145.2M
2026-05-14 03:32Hermes Agent (v2026.3.30)wandbdeepseek-aiDeepSeek-V4-Pro [W&B]-41.6%37523600s1503h29m63.7M761K64.5M
2026-05-14 01:42Terminus-2 (2.0.0)wandbdeepseek-aiDeepSeek-V4-Pro [W&B]-55.1%49403600s2101h50m135.4M1.9M137.3M
2026-05-13 22:37Hermes Agent (v2026.3.30)wandbdeepseek-aiDeepSeek-V4-Pro [W&B]-34.8%31583600s603h04m57.0M603K57.6M
2026-05-13 20:17Terminus-2 (2.0.0)wandbdeepseek-aiDeepSeek-V4-Pro [W&B]-55.1%49383600s3022h19m125.9M1.7M127.6M
2026-05-13 16:48Hermes Agent (v2026.3.30)wandbdeepseek-aiDeepSeek-V4-Pro [W&B]-29.2%26563600s673h29m35.3M422K35.7M
2026-05-13 14:25Terminus-2 (2.0.0)wandbdeepseek-aiDeepSeek-V4-Pro [W&B]-59.6%53323600s2042h14m139.1M2.0M141.2M
2026-05-13 11:19Hermes Agent (v2026.3.30)wandbdeepseek-aiDeepSeek-V4-Pro [W&B]-34.8%31513600s472h49m66.0M793K66.8M
2026-05-12 16:25Hermes Agent (v2026.3.30)wandbdeepseek-aiDeepSeek-V4-Flash [W&B]-41.6%37523600s304h15m70.6M973K71.6M
2026-05-12 00:12Terminus-2 (2.0.0)wandbdeepseek-aiDeepSeek-V4-Flash [W&B]-52.8%47383600s1441h38m223.0M2.7M225.6M
2026-05-11 11:10Terminus-2 (2.0.0)wandbdeepseek-aiDeepSeek-V4-Flash [W&B]-48.3%43453600s1811h56m227.0M2.8M229.9M
2026-05-11 07:04Hermes Agent (v2026.3.30)wandbdeepseek-aiDeepSeek-V4-Flash [W&B]-42.7%38503600s714h05m92.8M1.2M94.0M
2026-05-11 05:15Terminus-2 (2.0.0)wandbdeepseek-aiDeepSeek-V4-Flash [W&B]-48.3%43453600s1311h49m204.9M2.6M207.5M
2026-05-11 01:37Hermes Agent (v2026.3.30)wandbdeepseek-aiDeepSeek-V4-Flash [W&B]-42.7%38503600s313h37m81.6M1.1M82.7M
2026-05-10 23:22Terminus-2 (2.0.0)wandbdeepseek-aiDeepSeek-V4-Flash [W&B]-51.7%46423600s1112h15m269.6M2.8M272.5M
2026-05-10 19:55Hermes Agent (v2026.3.30)wandbdeepseek-aiDeepSeek-V4-Flash [W&B]-43.8%39493600s613h27m88.2M1.2M89.4M
2026-05-10 18:16Terminus-2 (2.0.0)wandbdeepseek-aiDeepSeek-V4-Flash [W&B]-51.7%46423600s1411h38m221.5M2.5M224.0M
2026-05-10 13:47Hermes Agent (v2026.3.30)wandbdeepseek-aiDeepSeek-V4-Flash [W&B]-46.1%41463600s324h28m88.5M1.1M89.6M
2026-04-27 06:24Hermes Agent (v2026.4.23)openaiopenaiGPT-5.5xhigh73.0%65243600s101h07m---
2026-04-27 05:15Hermes Agent (v2026.4.23)openaiopenaiGPT-5.5xhigh77.5%69203600s101h08m---
2026-04-27 04:09Hermes Agent (v2026.4.23)openaiopenaiGPT-5.5xhigh71.9%64253600s101h06m---
2026-04-27 03:00Hermes Agent (v2026.4.23)openaiopenaiGPT-5.5xhigh74.2%66233600s101h08m---
2026-04-27 01:53Hermes Agent (v2026.4.23)openaiopenaiGPT-5.5xhigh74.2%66233600s101h07m---
2026-04-26 11:01Cursor (2026.04.17)cursorcursorGPT-5.5high79.8%71183600s101h06m97.5M705K98.2M
2026-04-26 10:17Cursor (2026.04.17)cursorcursorGPT-5.5high74.2%66233600s000h44m100.1M755K100.9M
2026-04-26 08:17Cursor (2026.04.17)cursorcursorGPT-5.5high79.8%71173600s211h59m92.1M659K92.8M
2026-04-26 06:59Hermes Agent (v2026.4.23)openaiopenaiGPT-5.5medium70.8%63263600s101h05m---
2026-04-26 06:21Cursor (2026.04.17)cursorcursorGPT-5.5high79.8%71183600s101h56m85.7M665K86.3M
2026-04-26 05:54Hermes Agent (v2026.4.23)openaiopenaiGPT-5.5medium70.8%63263600s101h04m---
2026-04-26 04:58Hermes Agent (v2026.4.23)openaiopenaiGPT-5.5medium75.3%67223600s000h55m---
2026-04-26 04:05Hermes Agent (v2026.4.23)openaiopenaiGPT-5.5medium71.9%64253600s000h53m---
2026-04-26 03:34Cursor (2026.04.17)cursorcursorGPT-5.5high73.0%65233600s112h46m84.3M617K85.0M
2026-04-26 03:09Hermes Agent (v2026.4.23)openaiopenaiGPT-5.5medium75.3%67223600s000h56m---
2026-04-25 21:26Terminus-2 (2.0.0)openaiopenaiGPT-5.5xhigh75.3%67223600s501h15m21.9M2.3M24.2M
2026-04-25 20:08Terminus-2 (2.0.0)openaiopenaiGPT-5.5xhigh79.8%71173600s711h17m29.7M2.7M32.4M
2026-04-25 18:28Terminus-2 (2.0.0)openaiopenaiGPT-5.5xhigh74.2%66223600s711h40m24.5M2.3M26.8M
2026-04-25 17:15Terminus-2 (2.0.0)openaiopenaiGPT-5.5xhigh78.7%70193600s501h12m27.8M2.4M30.2M
2026-04-25 15:58Terminus-2 (2.0.0)openaiopenaiGPT-5.5xhigh76.4%68213600s401h16m20.3M2.4M22.7M
2026-04-25 12:44Terminus-2 (2.0.0)openaiopenaiGPT-5.5medium68.5%61283600s201h02m33.7M888K34.6M
2026-04-25 11:43Terminus-2 (2.0.0)openaiopenaiGPT-5.5medium71.9%64253600s101h00m30.7M834K31.5M
2026-04-25 10:02Terminus-2 (2.0.0)openaiopenaiGPT-5.5medium68.5%61273600s311h40m122.3M902K123.2M
2026-04-25 08:56Terminus-2 (2.0.0)openaiopenaiGPT-5.5medium67.4%60293600s201h06m25.6M818K26.4M
2026-04-25 07:46Terminus-2 (2.0.0)openaiopenaiGPT-5.5medium69.7%62273600s401h09m58.3M1.0M59.3M
2026-04-25 06:39OpenClaw (2026.4.23)openaiopenaiGPT-5.5off65.2%58313600s601h06m16.3M189K16.5M
2026-04-25 05:32OpenClaw (2026.4.23)openaiopenaiGPT-5.5off70.8%63263600s501h06m15.2M150K15.3M
2026-04-25 04:25OpenClaw (2026.4.23)openaiopenaiGPT-5.5off69.7%62273600s601h07m18.7M198K18.9M
2026-04-25 02:45OpenClaw (2026.4.23)openaiopenaiGPT-5.5off74.2%66223600s711h40m21.2M183K21.4M
2026-04-25 01:50Hermes Agent (v2026.3.30)wandbqamoonshotaiKimi K2.6 [Moonshot AI]-20.2%18703600s5513h09m---
2026-04-25 01:38OpenClaw (2026.4.23)openaiopenaiGPT-5.5off71.9%64253600s601h06m24.8M176K25.0M
2026-04-24 21:17Hermes Agent (v2026.3.30)wandbqamoonshotaiKimi K2.6 [Moonshot AI]-37.1%33563600s4904h32m---
2026-04-23 10:40Hermes Agent (v2026.3.30)moonshotaimoonshotaiKimi K2.6 [Moonshot AI]-57.3%51383600s1303h50m---
2026-04-23 06:22Hermes Agent (v2026.3.30)moonshotaimoonshotaiKimi K2.6 [Moonshot AI]-64.0%57323600s1404h17m---
2026-04-23 02:01Hermes Agent (v2026.3.30)wandbzai-orgGLM-5.1 [W&B]-41.6%37473600s655h48m72.9M768K73.6M
2026-04-23 01:45Hermes Agent (v2026.3.30)moonshotaimoonshotaiKimi K2.6 [Moonshot AI]-57.3%51353600s1334h36m---
2026-04-22 20:52Hermes Agent (v2026.3.30)wandbzai-orgGLM-5.1 [W&B]-47.2%42443600s435h08m63.4M779K64.1M
2026-04-22 14:28Hermes Agent (v2026.3.30)wandbzai-orgGLM-5.1 [W&B]-42.7%38453600s466h23m60.4M708K61.1M
2026-04-22 08:12Hermes Agent (v2026.3.30)wandbzai-orgGLM-5.1 [W&B]-42.7%38443600s476h16m70.9M789K71.7M
2026-04-21 18:57Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.7off65.2%58303600s314h41m---
2026-04-21 18:27Terminus-2 (2.0.0)openroutermoonshotaiKimi K2.6 [Moonshot AI]-60.7%54353600s1501h55m117.5M3.2M120.7M
2026-04-21 15:55Terminus-2 (2.0.0)openroutermoonshotaiKimi K2.6 [Moonshot AI]-55.1%49393600s1212h32m108.5M2.8M111.3M
2026-04-21 14:42Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.7off65.2%58303600s614h15m---
2026-04-21 14:07Terminus-2 (2.0.0)openroutermoonshotaiKimi K2.6 [Moonshot AI]-62.9%56333600s1401h47m112.2M2.8M115.0M
2026-04-21 11:47Terminus-2 (2.0.0)openroutermoonshotaiKimi K2.6 [Moonshot AI]-58.4%52373600s1302h20m100.0M2.8M102.9M
2026-04-21 10:08Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.7off62.9%56303600s734h33m---
2026-04-21 09:59Terminus-2 (2.0.0)openroutermoonshotaiKimi K2.6 [Moonshot AI]-57.3%51383600s1401h47m108.4M3.0M111.4M
2026-04-21 06:45OpenClaw (2026.3.11)openroutermoonshotaiKimi K2.6 [Moonshot AI]-53.9%48403600s1313h13m168.9M3.8M172.7M
2026-04-21 05:29OpenClaw (2026.3.11)openroutermoonshotaiKimi K2.6 [Moonshot AI]-56.2%50393600s1101h15m171.8M3.4M175.2M
2026-04-21 05:00Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.7off67.4%60283600s515h07m---
2026-04-21 03:49OpenClaw (2026.3.11)openroutermoonshotaiKimi K2.6 [Moonshot AI]-62.9%56323600s1211h40m203.6M4.2M207.8M
2026-04-21 02:32OpenClaw (2026.3.11)openroutermoonshotaiKimi K2.6 [Moonshot AI]-60.7%54353600s1101h16m161.6M3.5M165.1M
2026-04-21 00:51OpenClaw (2026.3.11)openroutermoonshotaiKimi K2.6 [Moonshot AI]-59.6%53353600s1411h41m212.0M4.3M216.3M
2026-04-21 00:32Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.7off70.8%63253600s314h28m---
2026-04-18 01:01OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.7off76.4%68203600s711h40m122.7M1.3M124.0M
2026-04-17 23:30Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.7off70.8%63263600s201h03m112.0M998K113.0M
2026-04-17 22:38Cursor (2026.04.16)cursorcursorClaude Opus 4.6high61.8%55333600s1011h18m80.4M863K81.3M
2026-04-17 21:52OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.7off77.5%69203600s701h38m215.6M1.7M217.3M
2026-04-17 21:15Cursor (2026.04.16)cursorcursorClaude Opus 4.6high60.7%54343600s811h22m94.3M918K95.2M
2026-04-17 20:47Claude Code (2.1.112)anthropicanthropicClaude Opus 4.7xhigh73.0%65243600s301h05m219.0M2.5M221.5M
2026-04-17 19:42Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.7off73.0%65243600s101h03m144.3M1.0M145.4M
2026-04-17 18:31OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.7off67.4%60293600s401h10m154.1M1.4M155.5M
2026-04-17 18:04Cursor (2026.04.16)cursorcursorClaude Opus 4.6high57.3%51373600s911h16m74.6M1.2M75.9M
2026-04-17 16:57Cursor (2026.04.16)cursorcursorClaude Opus 4.6high66.3%59293600s611h06m119.3M1.5M120.8M
2026-04-17 16:52Claude Code (2.1.112)anthropicanthropicClaude Opus 4.7xhigh73.0%65243600s601h39m214.0M2.3M216.3M
2026-04-17 12:52Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.7off70.8%63253600s411h40m193.8M1.3M195.1M
2026-04-17 11:48OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.7off74.2%66233600s301h04m143.9M1.3M145.2M
2026-04-17 11:37Cursor (2026.04.16)cursorcursorClaude Opus 4.6high67.4%60273600s621h15m137.7M1.6M139.4M
2026-04-17 10:09Claude Code (2.1.112)anthropicanthropicClaude Opus 4.7xhigh73.0%65243600s701h39m219.8M2.3M222.1M
2026-04-17 08:08Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.7off71.9%64243600s512h01m156.6M1.2M157.7M
2026-04-17 05:31Claude Code (2.1.112)anthropicanthropicClaude Opus 4.7xhigh74.2%66183600s551h19m269.4M2.5M271.9M
2026-04-17 04:04Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.7off70.8%63253600s311h27m139.4M1.2M140.5M
2026-04-17 02:46OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.7off80.9%72173600s501h17m129.8M1.2M131.0M
2026-04-17 01:18Claude Code (2.1.112)anthropicanthropicClaude Opus 4.7xhigh74.2%66233600s501h28m237.8M2.4M240.1M
2026-04-15 13:59Terminus-2 (2.0.0)wandbzai-orgGLM-5-FP8 [W&B]-44.9%40473600s1722h18m89.9M1.6M91.5M
2026-04-15 07:56Terminus-2 (2.0.0)wandbzai-orgGLM-5-FP8 [W&B]-48.3%43443600s2022h35m101.4M1.7M103.2M
2026-04-15 06:01Terminus-2 (2.0.0)wandbzai-orgGLM-5-FP8 [W&B]-47.2%42463600s2011h54m88.3M1.7M90.0M
2026-04-15 04:01Terminus-2 (2.0.0)wandbzai-orgGLM-5-FP8 [W&B]-51.7%46423600s1411h59m72.8M1.6M74.4M
2026-04-15 01:29Terminus-2 (2.0.0)wandbzai-orgGLM-5-FP8 [W&B]-43.8%39483600s1922h32m94.1M1.7M95.8M
2026-04-15 00:23OpenClaw (2026.3.11)wandbzai-orgGLM-5.1 [W&B]-31.5%28613600s201h04m24.0M461K24.5M
2026-04-14 21:17Terminus-2 (2.0.0)wandbzai-orgGLM-5.1 [W&B]-43.8%39453600s2952h45m25.5M937K26.5M
2026-04-14 18:51Terminus-2 (2.0.0)wandbzai-orgGLM-5.1 [W&B]-41.6%37523600s3802h25m24.0M925K24.9M
2026-04-14 17:10OpenClaw (2026.3.11)wandbzai-orgGLM-5.1 [W&B]-36.0%32563600s711h40m27.0M502K27.5M
2026-04-14 07:52Terminus-2 (2.0.0)wandbzai-orgGLM-5.1 [W&B]-39.3%35543600s4403h21m13.1M642K13.7M
2026-04-14 03:58Terminus-2 (2.0.0)wandbzai-orgGLM-5.1 [W&B]-40.4%36503600s3733h54m21.3M868K22.2M
2026-04-14 01:20Terminus-2 (2.0.0)wandbzai-orgGLM-5.1 [W&B]-47.2%42463600s4012h38m22.8M856K23.7M
2026-04-13 23:48OpenClaw (2026.3.11)wandbzai-orgGLM-5.1 [W&B]-25.8%23663600s401h31m22.3M395K22.7M
2026-04-13 22:07OpenClaw (2026.3.11)wandbzai-orgGLM-5.1 [W&B]-30.3%27613600s511h40m18.5M414K18.9M
2026-04-13 20:26OpenClaw (2026.3.11)wandbzai-orgGLM-5.1 [W&B]-39.3%35533600s1011h40m51.5M658K52.2M
2026-04-09 11:12Terminus-2 (2.0.0)wandbgoogleGemma 4 31B [W&B]-31.5%28593600s821h41m163.6M1.3M164.8M
2026-04-09 09:31Terminus-2 (2.0.0)wandbgoogleGemma 4 31B [W&B]-31.5%28603600s1311h40m217.1M1.5M218.6M
2026-04-09 08:03Terminus-2 (2.0.0)wandbgoogleGemma 4 31B [W&B]-30.3%27623600s1101h27m222.8M1.2M224.0M
2026-04-09 06:49Terminus-2 (2.0.0)wandbgoogleGemma 4 31B [W&B]-32.6%29593600s1211h13m188.2M1.5M189.7M
2026-04-09 05:08OpenClaw (2026.3.11)wandbgoogleGemma 4 31B [W&B]-19.1%17713600s511h40m147.0M1.3M148.3M
2026-04-09 03:57OpenClaw (2026.3.11)wandbgoogleGemma 4 31B [W&B]-18.0%16733600s201h11m203.5M1.5M205.0M
2026-04-09 01:52OpenClaw (2026.3.11)wandbgoogleGemma 4 31B [W&B]-19.1%17713600s812h04m200.5M1.4M201.9M
2026-04-09 00:25OpenClaw (2026.3.11)wandbgoogleGemma 4 31B [W&B]-16.9%15743600s301h26m179.6M1.6M181.2M
2026-04-08 23:18OpenClaw (2026.3.11)wandbgoogleGemma 4 31B [W&B]-18.0%16733600s701h07m124.2M1.5M125.7M
2026-04-06 07:52Terminus-2 (2.0.0)geminigeminiGemini 3.1 Flash Lite Preview-24.7%22663600s311h40m172.0M2.2M174.2M
2026-04-06 06:11Terminus-2 (2.0.0)geminigeminiGemini 3.1 Flash Lite Preview-28.1%25633600s211h40m174.2M1.5M175.7M
2026-04-06 05:06Terminus-2 (2.0.0)geminigeminiGemini 3.1 Flash Lite Preview-25.8%23663600s201h05m96.6M2.0M98.5M
2026-04-06 03:25Terminus-2 (2.0.0)geminigeminiGemini 3.1 Flash Lite Preview-21.3%19693600s211h40m156.2M1.9M158.1M
2026-04-06 02:22Terminus-2 (2.0.0)geminigeminiGemini 3.1 Flash Lite Preview-25.8%23663600s201h02m223.3M2.4M225.7M
2026-04-06 01:18OpenClaw (2026.3.11)googlegoogleGemini 3.1 Flash Lite Preview-20.2%18713600s501h04m239.4M837K240.3M
2026-04-05 23:37OpenClaw (2026.3.11)googlegoogleGemini 3.1 Flash Lite Preview-21.3%19693600s711h40m177.6M745K178.4M
2026-04-05 21:45OpenClaw (2026.3.11)googlegoogleGemini 3.1 Flash Lite Preview-22.5%20693600s401h51m162.3M741K163.0M
2026-04-05 20:04OpenClaw (2026.3.11)googlegoogleGemini 3.1 Flash Lite Preview-24.7%22663600s511h40m255.9M830K256.7M
2026-04-05 17:06OpenClaw (2026.3.11)googlegoogleGemini 3.1 Flash Lite Preview-25.8%23663600s602h57m123.1M697K123.8M
2026-04-05 11:05Terminus-2 (2.0.0)geminigeminiGemini 3 Flash Preview-41.6%37513600s411h40m284.7M1.2M285.9M
2026-04-05 09:59Terminus-2 (2.0.0)geminigeminiGemini 3 Flash Preview-41.6%37523600s501h05m270.6M1.2M271.8M
2026-04-05 08:19Terminus-2 (2.0.0)geminigeminiGemini 3 Flash Preview-46.1%41473600s311h40m310.4M1.3M311.8M
2026-04-05 07:15Terminus-2 (2.0.0)geminigeminiGemini 3 Flash Preview-48.3%43453600s411h03m490.5M1.6M492.1M
2026-04-05 06:09Terminus-2 (2.0.0)geminigeminiGemini 3 Flash Preview-43.8%39503600s501h05m252.4M1.2M253.6M
2026-04-05 04:28OpenClaw (2026.3.11)googlegoogleGemini 3 Flash Preview-40.4%36523600s911h40m210.1M472K210.6M
2026-04-05 02:47OpenClaw (2026.3.11)googlegoogleGemini 3 Flash Preview-36.0%32563600s711h40m377.6M653K378.3M
2026-04-05 01:06OpenClaw (2026.3.11)googlegoogleGemini 3 Flash Preview-46.1%41473600s911h40m265.0M753K265.8M
2026-04-04 23:25OpenClaw (2026.3.11)googlegoogleGemini 3 Flash Preview-40.4%36523600s711h40m210.4M670K211.1M
2026-04-04 21:44OpenClaw (2026.3.11)googlegoogleGemini 3 Flash Preview-40.4%36523600s911h40m920.6M1.2M921.8M
2026-04-03 11:44Terminus-2 (2.0.0)geminigeminiGemini 3.1 Pro Preview-50.6%45443600s000h31m15.4M653K16.1M
2026-04-03 11:08Terminus-2 (2.0.0)geminigeminiGemini 3.1 Pro Preview-56.2%50393600s000h35m15.6M568K16.2M
2026-04-03 10:41Terminus-2 (2.0.0)geminigeminiGemini 3.1 Pro Preview-52.8%47423600s000h26m23.7M700K24.4M
2026-04-03 09:01Terminus-2 (2.0.0)geminigeminiGemini 3.1 Pro Preview-50.6%45433600s111h40m13.1M634K13.7M
2026-04-03 07:58Terminus-2 (2.0.0)geminigeminiGemini 3.1 Pro Preview-48.3%43453600s111h02m19.4M670K20.1M
2026-04-03 06:53OpenClaw (2026.3.11)googlegoogleGemini 3.1 Pro Preview-59.6%53363600s601h05m228.7M638K229.3M
2026-04-03 05:46OpenClaw (2026.3.11)googlegoogleGemini 3.1 Pro Preview-57.3%51383600s501h06m226.5M748K227.2M
2026-04-03 04:05OpenClaw (2026.3.11)googlegoogleGemini 3.1 Pro Preview-60.7%54343600s711h40m131.0M652K131.7M
2026-04-03 02:24OpenClaw (2026.3.11)googlegoogleGemini 3.1 Pro Preview-62.9%56323600s811h40m239.2M696K239.9M
2026-04-03 01:18OpenClaw (2026.3.11)googlegoogleGemini 3.1 Pro Preview-56.2%50393600s601h06m102.8M485K103.3M
2026-04-02 12:00Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.6off61.8%55313600s536h17m68.4M1.0M69.4M
2026-04-02 07:39Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.5 (nvfp4) [W&B]-39.3%35523600s625h23m75.8M1.6M77.4M
2026-04-02 07:08Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.6off64.0%57323600s504h51m75.6M1.0M76.7M
2026-04-02 05:24Hermes Agent (v2026.3.30)openaiopenaiGPT-5.4medium70.8%63263600s202h38m80.2M996K81.2M
2026-04-02 03:25Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.5 (nvfp4) [W&B]-40.4%36523600s614h13m82.0M1.7M83.7M
2026-04-02 02:55Hermes Agent (v2026.3.30)openaiopenaiGPT-5.4medium65.2%58313600s202h28m70.8M960K71.8M
2026-04-02 00:41Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.6off67.4%60293600s506h25m76.1M1.2M77.3M
2026-04-01 23:30Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.5 (nvfp4) [W&B]-39.3%35533600s513h54m65.4M1.3M66.6M
2026-04-01 20:09Hermes Agent (v2026.3.30)openaiopenaiGPT-5.4medium66.3%59303600s302h22m86.4M1.0M87.4M
2026-04-01 19:52Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.5 (nvfp4) [W&B]-44.9%40493600s403h38m77.7M1.6M79.3M
2026-04-01 19:49Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.6off61.8%55343600s404h51m69.4M1.1M70.5M
2026-04-01 17:47Hermes Agent (v2026.3.30)openaiopenaiGPT-5.4medium65.2%58303600s312h21m67.2M900K68.1M
2026-04-01 14:45Hermes Agent (v2026.3.30)wandbmoonshotaiKimi K2.5 (nvfp4) [W&B]-42.7%38503600s715h06m88.1M1.6M89.7M
2026-04-01 14:44Hermes Agent (v2026.3.30)openaiopenaiGPT-5.4medium64.0%57313600s113h02m64.5M847K65.3M
2026-04-01 14:44Hermes Agent (v2026.3.30)anthropicanthropicClaude Opus 4.6off64.0%57313600s315h04m69.5M1.1M70.5M
2026-03-29 07:01OpenClaw (2026.3.11)wandbzai-orgGLM-5-FP8 [W&B]-39.3%35453600s692h57m133.6M1.1M134.7M
2026-03-29 04:05OpenClaw (2026.3.11)wandbzai-orgGLM-5-FP8 [W&B]-37.1%33513600s152h55m91.3M923K92.3M
2026-03-29 01:00OpenClaw (2026.3.11)wandbzai-orgGLM-5-FP8 [W&B]-38.2%34503600s353h04m104.7M861K105.6M
2026-03-27 19:54OpenClaw (2026.3.11)wandbzai-orgGLM-5-FP8 [W&B]-31.5%28533600s583h07m102.7M923K103.6M
2026-03-27 16:16OpenClaw (2026.3.11)wandbzai-orgGLM-5-FP8 [W&B]-37.1%33473600s293h37m90.2M797K91.0M
2026-03-27 13:20OpenClaw (2026.3.11)wandbMiniMaxAIMiniMax M2.5 [W&B]-42.7%38473600s042h55m69.4M984K70.4M
2026-03-27 11:08OpenClaw (2026.3.11)wandbMiniMaxAIMiniMax M2.5 [W&B]-37.1%33493600s172h12m72.7M1.0M73.7M
2026-03-27 08:24OpenClaw (2026.3.11)wandbMiniMaxAIMiniMax M2.5 [W&B]-37.1%33503600s062h42m66.5M885K67.4M
2026-03-27 06:53Terminus-2 (2.0.0)wandbMiniMaxAIMiniMax M2.5 [W&B]-50.6%45423600s2421h31m74.9M1.4M76.4M
2026-03-27 04:47Terminus-2 (2.0.0)wandbMiniMaxAIMiniMax M2.5 [W&B]-43.8%39453600s2552h05m114.6M1.6M116.2M
2026-03-27 02:58Terminus-2 (2.0.0)wandbMiniMaxAIMiniMax M2.5 [W&B]-49.4%44433600s1721h48m84.9M1.5M86.4M
2026-03-26 12:31Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (nvfp4) [W&B]-47.2%42453600s721h47m306.4M2.0M308.4M
2026-03-26 09:37OpenClaw (2026.3.11)wandbMiniMaxAIMiniMax M2.5 [W&B]-33.7%30533600s362h51m72.7M969K73.7M
2026-03-26 07:14OpenClaw (2026.3.11)wandbMiniMaxAIMiniMax M2.5 [W&B]-32.6%29523600s182h22m68.1M996K69.1M
2026-03-26 06:07Terminus-2 (2.0.0)wandbMiniMaxAIMiniMax M2.5 [W&B]-41.6%37503600s3121h06m64.1M1.5M65.5M
2026-03-26 04:26Terminus-2 (2.0.0)wandbMiniMaxAIMiniMax M2.5 [W&B]-49.4%44423600s2231h41m74.7M1.4M76.1M
2026-03-20 06:43Terminus-2 (2.0.0)openrouterminimaxMiniMax M2.7-49.4%44453600s1801h33m245.0M2.4M247.4M
2026-03-20 03:45Terminus-2 (2.0.0)openrouterminimaxMiniMax M2.7-55.1%49393600s1612h57m337.1M2.5M339.7M
2026-03-20 02:30OpenClaw (2026.3.11)openrouterminimaxMiniMax M2.7-49.4%44453600s701h14m135.9M2.4M138.3M
2026-03-20 00:17OpenClaw (2026.3.11)openrouterminimaxMiniMax M2.7-48.3%43453600s612h12m104.3M2.2M106.5M
2026-03-19 13:39Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 nano-20.2%18713600s3202h01m1.32B2.6M1.32B
2026-03-19 12:31Terminus-2 (2.0.0)openrouterminimaxMiniMax M2.7-52.8%47423600s1501h31m317.7M2.7M320.4M
2026-03-19 12:08Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 nano-23.6%21683600s2801h31m1.08B2.2M1.08B
2026-03-19 10:50Terminus-2 (2.0.0)openrouterminimaxMiniMax M2.7-47.2%42463600s1811h40m249.8M2.5M252.2M
2026-03-19 10:43Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 nano-23.6%21683600s2101h24m1.09B2.0M1.09B
2026-03-19 09:29Terminus-2 (2.0.0)mistralmistralMistral Small 4 119B A6B-25.8%23593600s171h54m147.8M1.1M149.0M
2026-03-19 09:19Terminus-2 (2.0.0)openrouterminimaxMiniMax M2.7-55.1%49403600s1901h30m192.0M2.4M194.5M
2026-03-19 09:18Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 mini-27.0%24653600s2401h24m845.5M1.3M846.7M
2026-03-19 08:13Terminus-2 (2.0.0)mistralmistralMistral Small 4 119B A6B-21.3%19703600s401h15m455.3M1.6M456.9M
2026-03-19 07:38OpenClaw (2026.3.11)openrouterminimaxMiniMax M2.7-46.1%41473600s411h40m100.8M2.3M103.1M
2026-03-19 07:37Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 mini-25.8%23653600s2111h40m810.4M1.3M811.7M
2026-03-19 06:59Terminus-2 (2.0.0)mistralmistralMistral Small 4 119B A6B-23.6%21683600s401h13m232.2M1.5M233.7M
2026-03-19 05:57OpenClaw (2026.3.11)openrouterminimaxMiniMax M2.7-42.7%38503600s611h40m113.1M2.5M115.5M
2026-03-19 05:57Terminus-2 (2.0.0)openaiopenaiGPT‑5.4 mini-25.8%23663600s1701h39m847.8M1.3M849.1M
2026-03-19 05:53OpenClaw (2026.3.11)mistralmistralMistral Small 4 119B A6B-18.0%16723600s411h05m110.5M772K111.3M
2026-03-19 04:56OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 nano-12.4%11783600s101h00m25.2M156K25.4M
2026-03-19 04:47OpenClaw (2026.3.11)mistralmistralMistral Small 4 119B A6B-16.9%15743600s601h05m120.9M842K121.8M
2026-03-19 04:11OpenClaw (2026.3.11)openrouterminimaxMiniMax M2.7-41.6%37513600s311h45m126.5M2.3M128.8M
2026-03-19 03:54OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 nano-13.5%12773600s101h02m19.4M143K19.6M
2026-03-19 03:24OpenClaw (2026.3.11)mistralmistralMistral Small 4 119B A6B-15.7%14753600s701h23m115.9M758K116.7M
2026-03-19 02:53OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 nano-16.9%15743600s101h00m13.4M123K13.5M
2026-03-18 08:05OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 mini-10.1%9803600s201h02m20.3M170K20.5M
2026-03-18 07:03OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 mini-14.6%13763600s301h02m16.7M159K16.8M
2026-03-18 06:01OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 mini-14.6%13763600s201h02m21.0M164K21.2M
2026-03-18 04:58OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 mini-18.0%16733600s101h02m19.4M162K19.6M
2026-03-18 03:56OpenClaw (2026.3.11)openaiopenaiGPT‑5.4 mini-13.5%12773600s101h02m16.9M155K17.1M
2026-03-16 13:58Terminus-2 (2.0.0)openrouterz-aiGLM-5-Turbo-49.4%44433600s1322h15m361.5M2.7M364.2M
2026-03-16 11:53Terminus-2 (2.0.0)openrouterz-aiGLM-5-Turbo-46.1%41483600s1402h03m285.8M2.5M288.3M
2026-03-16 10:27OpenClaw (2026.3.11)openrouterz-aiGLM-5-Turbo-47.2%42473600s701h25m117.6M3.4M121.0M
2026-03-16 09:17OpenClaw (2026.3.11)openrouterz-aiGLM-5-Turbo-46.1%41483600s601h10m65.6M2.5M68.1M
2026-03-16 07:45OpenClaw (2026.3.11)openrouterz-aiGLM-5-Turbo-47.2%42473600s1001h31m72.0M2.8M74.8M
2026-03-16 06:04OpenClaw (2026.3.11)openrouterz-aiGLM-5-Turbo-49.4%44443600s911h40m87.2M2.7M89.9M
2026-03-16 04:23OpenClaw (2026.3.11)openrouterz-aiGLM-5-Turbo-43.8%39493600s611h40m116.1M3.4M119.5M
2026-03-16 01:49Terminus-2 (2.0.0)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]-31.5%28613600s2102h01m153.9M4.0M157.9M
2026-03-15 23:55Terminus-2 (2.0.0)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]-38.2%34553600s1901h54m150.4M3.9M154.3M
2026-03-15 22:13Terminus-2 (2.0.0)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]-31.5%28613600s1601h41m132.0M3.9M135.9M
2026-03-15 20:17Terminus-2 (2.0.0)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]-38.2%34553600s1901h55m151.3M4.0M155.3M
2026-03-15 18:06Terminus-2 (2.0.0)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]-39.3%35543600s2202h10m177.4M4.3M181.7M
2026-03-15 01:29OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.6max56.2%50383600s711h40m75.5M1.4M76.9M
2026-03-14 23:48OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.6max52.8%47413600s1011h40m87.4M1.7M89.1M
2026-03-14 22:09OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.6max59.6%53363600s801h39m76.4M1.7M78.1M
2026-03-14 20:31OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.6max58.4%52373600s701h37m90.1M1.6M91.7M
2026-03-14 19:40OpenClaw (2026.3.1)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]-19.1%17713600s411h40m72.6M765K73.3M
2026-03-14 18:50OpenClaw (2026.3.11)anthropicanthropicClaude Opus 4.6max59.6%53353600s511h40m100.0M1.9M101.8M
2026-03-14 18:18OpenClaw (2026.3.1)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]-23.6%21683600s701h22m83.9M1.0M84.9M
2026-03-14 17:45Claude Code (2.1.75)anthropicanthropicClaude Opus 4.6max60.7%54343600s511h04m146.6M1.5M148.1M
2026-03-14 17:09OpenClaw (2026.3.1)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]-16.9%15743600s401h08m54.7M773K55.4M
2026-03-14 16:38Claude Code (2.1.75)anthropicanthropicClaude Opus 4.6max57.3%51373600s911h07m135.0M1.5M136.5M
2026-03-14 15:33OpenClaw (2026.3.1)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]-21.3%19693600s811h35m75.5M961K76.5M
2026-03-14 15:20Claude Code (2.1.75)anthropicanthropicClaude Opus 4.6max60.7%54343600s611h17m132.5M1.8M134.3M
2026-03-14 14:20OpenClaw (2026.3.1)wandbnvidiaNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]-20.2%18703600s411h12m62.0M967K63.0M
2026-03-14 14:05Claude Code (2.1.75)anthropicanthropicClaude Opus 4.6max60.7%54343600s711h15m146.6M1.6M148.2M
2026-03-14 12:32Claude Code (2.1.75)anthropicanthropicClaude Opus 4.6max58.4%52363600s911h32m176.0M1.4M177.4M
2026-03-14 10:34Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6max55.1%49393600s2111h57m77.8M2.6M80.4M
2026-03-14 08:48Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6max58.4%52363600s1611h45m61.7M2.5M64.2M
2026-03-14 07:02Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6max60.7%54353600s1601h45m82.0M2.3M84.3M
2026-03-14 04:57Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6max61.8%55343600s1502h04m75.0M2.3M77.4M
2026-03-14 02:54Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6max60.7%54353600s1902h02m73.6M2.3M75.9M
2026-03-12 21:45OpenClaw (2026.3.11)openaiopenaiGPT-5.4low59.6%53353600s1211h40m57.9M577K58.5M
2026-03-12 20:33OpenClaw (2026.3.11)openaiopenaiGPT-5.4low61.8%55343600s1101h10m81.0M613K81.6M
2026-03-12 19:23OpenClaw (2026.3.11)openaiopenaiGPT-5.4low57.3%51383600s1001h10m70.8M602K71.4M
2026-03-12 18:18OpenClaw (2026.3.11)openaiopenaiGPT-5.4low59.6%53363600s1001h05m67.8M618K68.4M
2026-03-12 17:08OpenClaw (2026.3.11)openaiopenaiGPT-5.4low66.3%59303600s1001h09m79.2M603K79.8M
2026-03-12 12:49OpenClaw (2026.3.11)openaiopenaiGPT-5.4xhigh70.8%63263600s1301h12m141.0M1.7M142.7M
2026-03-12 11:27OpenClaw (2026.3.11)openaiopenaiGPT-5.4xhigh71.9%64253600s1101h21m135.4M1.7M137.1M
2026-03-12 10:27Terminus-2 (2.0.0)openaiopenaiGPT-5.4xhigh67.4%60283600s1112h15m14.7M6.9M21.6M
2026-03-12 10:16OpenClaw (2026.3.11)openaiopenaiGPT-5.4xhigh70.8%63263600s1301h10m147.5M1.8M149.3M
2026-03-12 09:03OpenClaw (2026.3.11)openaiopenaiGPT-5.4xhigh69.7%62273600s1201h12m156.4M1.9M158.3M
2026-03-12 08:39Terminus-2 (2.0.0)openaiopenaiGPT-5.4xhigh73.0%65243600s1101h47m13.5M6.3M19.9M
2026-03-12 07:38OpenClaw (2026.3.11)openaiopenaiGPT-5.4xhigh71.9%64253600s1001h25m145.6M1.6M147.2M
2026-03-12 06:17Terminus-2 (2.0.0)openaiopenaiGPT-5.4xhigh64.0%57313600s1012h21m12.4M6.1M18.5M
2026-03-12 04:57Terminus-2 (2.0.0)openaiopenaiGPT-5.4xhigh70.8%63263600s801h19m10.3M5.5M15.8M
2026-03-12 03:25Terminus-2 (2.0.0)openaiopenaiGPT-5.4xhigh69.7%62273600s1101h31m13.3M5.7M19.0M
2026-03-10 12:31Terminus-2 (2.0.0)openaiopenaiKimi K2.5 (nvfp4) [W&B]-47.2%42473600s1201h27m114.4M1.8M116.2M
2026-03-10 11:30OpenClaw (2026.3.1)openaiopenaiGPT-5.3-Codex-53.9%48413600s501h04m33.1M360K33.5M
2026-03-10 10:15Terminus-2 (2.0.0)openaiopenaiKimi K2.5 (nvfp4) [W&B]-49.4%44443600s1112h15m140.6M2.2M142.8M
2026-03-10 09:49OpenClaw (2026.3.1)openaiopenaiGPT-5.3-Codex-55.1%49393600s811h40m31.8M339K32.1M
2026-03-10 08:43OpenClaw (2026.3.1)openaiopenaiGPT-5.3-Codex-56.2%50383600s711h05m35.6M356K35.9M
2026-03-10 08:42Terminus-2 (2.0.0)openaiopenaiKimi K2.5 (nvfp4) [W&B]-46.1%41483600s1301h32m138.5M2.0M140.5M
2026-03-10 07:38OpenClaw (2026.3.1)openaiopenaiGPT-5.3-Codex-56.2%50393600s501h04m31.6M371K32.0M
2026-03-10 06:29OpenClaw (2026.3.1)openaiopenaiGPT-5.3-Codex-53.9%48413600s801h09m30.3M351K30.7M
2026-03-10 06:25Terminus-2 (2.0.0)openaiopenaiKimi K2.5 (nvfp4) [W&B]-46.1%41473600s1312h17m116.0M2.1M118.1M
2026-03-10 04:58OpenClaw (2026.3.1)customcustomKimi K2.5 (nvfp4) [W&B]-37.1%33563600s1401h26m181.2M1.5M182.7M
2026-03-10 03:33OpenClaw (2026.3.1)customcustomKimi K2.5 (nvfp4) [W&B]-38.2%34553600s1001h25m144.8M1.3M146.1M
2026-03-10 01:37OpenClaw (2026.3.1)customcustomKimi K2.5 (nvfp4) [W&B]-33.7%30593600s901h55m167.6M1.4M169.0M
2026-03-10 00:20OpenClaw (2026.3.1)customcustomKimi K2.5 (nvfp4) [W&B]-38.2%34553600s1001h16m237.4M1.6M239.1M
2026-03-09 23:12OpenClaw (2026.3.1)customcustomKimi K2.5 (nvfp4) [W&B]-37.1%33553600s1011h07m92.9M1.1M94.0M
2026-03-09 14:05Terminus-2 (2.0.0)openaiopenaiGPT-5.3-Codex-38.2%34553600s1301h14m457.8M622K458.5M
2026-03-09 12:59Terminus-2 (2.0.0)openaiopenaiGPT-5.3-Codex-41.6%37523600s1301h05m478.0M570K478.6M
2026-03-09 11:33Terminus-2 (2.0.0)openaiopenaiGPT-5.3-Codex-38.2%34553600s1701h25m628.9M671K629.5M
2026-03-09 09:42Terminus-2 (2.0.0)openaiopenaiGPT-5.3-Codex-39.3%35533600s1111h50m515.0M646K515.6M
2026-03-09 08:09Terminus-2 (2.0.0)openaiopenaiGPT-5.3-Codex-38.2%34553600s1301h32m480.1M687K480.8M
2026-03-09 07:46OpenClaw (2026.3.1)openaiopenaiGPT-5.4off28.1%25643600s701h04m29.5M288K29.8M
2026-03-09 07:00Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6off69.7%62263600s811h08m146.4M1.6M148.0M
2026-03-09 06:38OpenClaw (2026.3.1)openaiopenaiGPT-5.4off32.6%29603600s1101h07m35.7M345K36.1M
2026-03-09 05:53Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6off71.9%64253600s501h06m151.5M1.5M153.0M
2026-03-09 05:33OpenClaw (2026.3.1)openaiopenaiGPT-5.4off31.5%28613600s701h04m34.4M326K34.8M
2026-03-09 04:27OpenClaw (2026.3.1)openaiopenaiGPT-5.4off30.3%27623600s701h06m42.8M322K43.1M
2026-03-09 04:13Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6off70.8%63253600s811h40m175.6M1.7M177.3M
2026-03-09 03:21OpenClaw (2026.3.1)openaiopenaiGPT-5.4off29.2%26633600s501h05m29.6M333K30.0M
2026-03-09 03:07Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6off68.5%61283600s401h05m153.0M1.4M154.4M
2026-03-08 19:15Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (int4) [W&B]-51.7%46423600s1411h26m204.5M1.7M206.1M
2026-03-08 17:46Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (int4) [W&B]-48.3%43463600s1201h29m193.4M1.7M195.1M
2026-03-08 16:04Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (int4) [W&B]-46.1%41483600s1401h41m236.0M1.7M237.7M
2026-03-08 14:26Terminus-2 (2.0.0)anthropicanthropicClaude Opus 4.6off75.3%67223600s301h03m155.2M1.4M156.6M
2026-03-08 13:51Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (int4) [W&B]-49.4%44443600s1312h12m195.4M1.7M197.1M
2026-03-08 12:24Terminus-2 (2.0.0)wandbmoonshotaiKimi K2.5 (int4) [W&B]-46.1%41483600s1501h26m197.7M1.7M199.4M
2026-03-08 12:12Terminus-2 (2.0.0)anthropicanthropicClaude Sonnet 4.6-61.8%55343600s1001h07m259.5M2.2M261.8M
2026-03-08 10:25Terminus-2 (2.0.0)anthropicanthropicClaude Sonnet 4.6-59.6%53363600s701h46m216.2M1.9M218.1M
2026-03-08 09:53OpenClaw (2026.3.1)wandbmoonshotaiKimi K2.5 (int4) [W&B]-37.1%33553600s1312h30m192.2M1.6M193.8M
2026-03-08 09:18Terminus-2 (2.0.0)anthropicanthropicClaude Sonnet 4.6-62.9%56333600s1301h06m192.5M2.0M194.5M
2026-03-08 08:31OpenClaw (2026.3.1)wandbmoonshotaiKimi K2.5 (int4) [W&B]-39.3%35533600s1311h21m188.4M1.6M190.0M
2026-03-08 08:10Terminus-2 (2.0.0)anthropicanthropicClaude Sonnet 4.6-62.9%56333600s601h08m189.4M1.9M191.3M
2026-03-08 07:14OpenClaw (2026.3.1)wandbmoonshotaiKimi K2.5 (int4) [W&B]-34.8%31583600s701h16m228.8M1.6M230.3M
2026-03-08 06:21Terminus-2 (2.0.0)anthropicanthropicClaude Sonnet 4.6-64.0%57323600s801h48m151.2M1.9M153.1M
2026-03-08 05:38OpenClaw (2026.3.1)wandbmoonshotaiKimi K2.5 (int4) [W&B]-44.9%40493600s1101h35m176.6M1.4M178.0M
2026-03-08 05:12OpenClaw (2026.3.1)anthropicanthropicClaude Opus 4.6medium57.3%51383600s1001h09m97.6M1.4M99.0M
2026-03-08 04:09OpenClaw (2026.3.1)wandbmoonshotaiKimi K2.5 (int4) [W&B]-38.2%34543600s611h29m171.4M1.6M173.1M
2026-03-08 04:03OpenClaw (2026.3.1)anthropicanthropicClaude Opus 4.6medium57.3%51383600s701h08m78.6M1.3M79.9M
2026-03-08 02:53OpenClaw (2026.3.1)anthropicanthropicClaude Opus 4.6medium56.2%50393600s1001h10m74.2M1.3M75.6M
2026-03-08 01:44OpenClaw (2026.3.1)anthropicanthropicClaude Opus 4.6medium58.4%52373600s801h08m73.8M1.3M75.0M
2026-03-08 00:37OpenClaw (2026.3.1)anthropicanthropicClaude Opus 4.6medium58.4%52373600s501h07m83.4M1.3M84.7M
2026-03-08 00:33Terminus-2 (2.0.0)openaiopenaiGPT-5.4off44.9%40483600s1212h06m726.8M1.0M727.8M
2026-03-07 23:28OpenClaw (2026.3.1)anthropicanthropicClaude Sonnet 4.6-51.7%46413600s321h08m95.3M2.1M97.5M
2026-03-07 23:25Terminus-2 (2.0.0)openaiopenaiGPT-5.4off43.8%39503600s1401h08m667.0M905K667.9M
2026-03-07 22:16Terminus-2 (2.0.0)openaiopenaiGPT-5.4off42.7%38513600s1201h08m707.0M878K707.9M
2026-03-07 22:07OpenClaw (2026.3.1)anthropicanthropicClaude Sonnet 4.6-55.1%49403600s501h20m86.8M2.0M88.8M
2026-03-07 20:58OpenClaw (2026.3.1)anthropicanthropicClaude Sonnet 4.6-56.2%50393600s301h09m78.6M2.0M80.6M
2026-03-07 20:10Terminus-2 (2.0.0)openaiopenaiGPT-5.4off41.6%37513600s1512h05m759.6M939K760.6M
2026-03-07 19:47OpenClaw (2026.3.1)anthropicanthropicClaude Sonnet 4.6-51.7%46433600s201h10m71.6M2.0M73.5M
2026-03-07 18:57Terminus-2 (2.0.0)openaiopenaiGPT-5.4off47.2%42463600s1411h12m775.5M982K776.5M
2026-03-07 18:15OpenClaw (2026.3.1)anthropicanthropicClaude Sonnet 4.6-48.3%43463600s601h31m115.2M2.3M117.6M
2026-03-07 16:53Claude Code (2.1.63)anthropicanthropicClaude Opus 4.6high67.4%60283600s611h22m222.3M1.2M223.5M
2026-03-07 15:47Claude Code (2.1.63)anthropicanthropicClaude Opus 4.6high62.9%56333600s401h05m195.9M1.6M197.5M
2026-03-07 14:27Claude Code (2.1.63)anthropicanthropicClaude Opus 4.6high58.4%52363600s611h20m169.0M1.2M170.2M
2026-03-07 13:18Claude Code (2.1.63)anthropicanthropicClaude Opus 4.6high59.6%53363600s701h09m188.9M1.2M190.0M
2026-03-07 11:53Claude Code (2.1.63)anthropicanthropicClaude Opus 4.6high67.4%60293600s501h24m209.0M1.4M210.5M
2026-03-07 10:13Claude Code (2.1.63)anthropicanthropicClaude Sonnet 4.6-53.9%48413600s1201h39m202.1M2.1M204.2M
2026-03-07 09:08Claude Code (2.1.63)anthropicanthropicClaude Sonnet 4.6-57.3%51383600s401h05m166.9M1.7M168.5M
2026-03-07 08:02Claude Code (2.1.63)anthropicanthropicClaude Sonnet 4.6-62.9%56333600s301h05m185.9M1.8M187.7M
2026-03-07 06:56Claude Code (2.1.63)anthropicanthropicClaude Sonnet 4.6-57.3%51383600s601h05m210.7M2.2M213.0M
2026-03-07 04:37Claude Code (2.1.63)anthropicanthropicClaude Sonnet 4.6-56.2%50383600s512h19m216.0M2.3M218.3M

About WolfBench

Wolfram Ravenwolfby Wolfram Ravenwolf – who evaluates models for breakfast, builds agents at night, and preaches AI usefulness all day long.

Welcome to WolfBench – we’re just getting started. What you see here is an early preview with only a handful of models and agents tested so far. We’re continuously expanding the lineup, running fresh evals, and sharing interesting findings and insights along the way. Watch this space.

AI agents are becoming essential tools. Every week, a new model comes out and claims to be “the best at coding” or “SOTA on agentic tasks.” But what does that actually mean for you – the person who’s going to throw real work at these things?

A single score tells you almost nothing.

Most benchmarks give you one number: “Model X scored 42% on Benchmark Y.” Great. But can you rely on it? Was that a lucky run? Would it score the same tomorrow? What’s the floor – the tasks it always nails? What’s the ceiling – what it could do if the stars align?

WolfBench exists because we got tired of meaningless leaderboards. We wanted to know which model, which agent, and which settings actually deliver the best results on real agentic tasks – not just on paper, but in practice, consistently, across multiple runs.

What is it?

WolfBench is an evaluation framework built on top of Terminal-Bench 2.0, a popular agentic benchmark consisting of 89 diverse real-world tasks. These aren’t just coding puzzles. They span the kind of work you’d actually ask an AI agent to do:

The key word is agentic: these tasks require the model to plan, execute shell commands, inspect results, debug failures, and iterate – just like a human developer or sysadmin would. No multiple-choice shortcuts. No toy puzzles. Real work in real sandboxed environments.

Why WolfBench is different

The Five-Metric Framework

Performance is a distribution, not a point. One number can’t capture what an AI agent is truly capable of. Five numbers get a lot closer.

★ Ceiling: What’s theoretically possible?

The union of all tasks ever solved across all runs. If the model solved task A in run 3 and task B in run 5 (but never both in the same run), both count toward the ceiling.

It tells you the theoretical maximum performance this model is capable of with a given agent – even if no single run achieves it. It reveals variance-limited tasks: solvable, but not reliably.

▲ Best-of: What’s the peak in a single run?

The highest score from any individual run.

This is the “marketing number” – but with context. The closer the best-of is to the average, the more consistent the model performs. A large gap between best-of and average means you’re rolling dice every time you run it.

∅ Average: What can you normally expect?

The mean score across all valid runs.

This is the most commonly reported metric – and it is useful, but only with enough runs to be stable. With a single run? It’s a coin flip.

▼ Worst-of: How bad can a single run get?

The lowest score from any individual run.

This is the opposite of best-of – the floor, the worst case. The gap between worst-of and best-of defines the full score range across all runs. A narrow range means predictable performance; a wide range means you’re rolling dice.

■ Solid: What does it always get right?

Tasks that the model solves across all runs – the rock-solid base with zero variance.

The higher the solid base, the more dependable the agent is. These are the tasks you can confidently delegate and expect success every time. A model with a high solid base and moderate average is often more reliable in practice than one with a high average but low solid base – because you know what you’re getting.

Reading the Chart

The five metrics are shown for each model/configuration as stacked bar segments from the rock-solid base up to the ceiling. Optional 3D mode adds token volume as depth: input tokens in front, output tokens behind. The spread between the segments tells you as much as the numbers themselves:

The Bottom Line

Performance is more complex than a single average score – and the decisions you make based on benchmarks deserve better data than that. WolfBench gives you five angles on every model and configuration, so you can form a more complete and realistic judgement of what an AI agent will actually deliver when you put it to work.

Because at the end of the day, you don’t just want to know which model scored the highest. You want to know which one you can trust.

What’s Next

We will continuously add models and agents to the chart, publish the traces and evals on W&B Weave, and release regular blog posts detailing interesting and insightful findings.

This benchmark offers enormous potential for discovery. For instance: Why does xhigh reasoning improve GPT 5.4’s performance while max effort degrades Opus 4.6’s results? How does Claude Code fare when running a GPT or Gemini model compared to running directly with Opus or Sonnet – or Codex with Claude or Gemini? Is a “cheap” model actually cost-effective if it consumes far more tokens than a more expensive alternative? How does quantization affect performance of local models in agentic tasks?

So many possibilities for analysis – and for posting about it! Stay tuned – and if you want to be the first to know when new results come in, follow me on X and LinkedIn.

Inference and sandbox compute sponsored by CoreWeave: The Essential Cloud for AI.
Additional sandbox compute by Daytona – Secure Infrastructure for Running AI-Generated Code.
Built with Harbor for orchestration, Terminal-Bench 2.0 for tasks, and W&B Weave for tracking.
Charts and dashboards generated with marimo notebooks.
Explore the complete data and tooling suite on our WolfBench GitHub.