WolfBench

Wolfram Ravenwolf’s Five-Metric Framework · based on Terminal-Bench 2.0

One score is not enough.
Because performance is a distribution, not a point.

Most benchmarks report a single average. WolfBench shows five metrics that tell the full story – from the rock-solid base of tasks solved every time, through the average, up to the ceiling of everything ever solved – plus the best and worst single runs that frame the spread. Together, they reveal what no single number can: how consistent an AI agent truly is.
Learn more ↓

★ Ceiling (ever solved)▲ Best-of (peak run)∅ Average (mean score)▼ Worst-of (lowest run)■ Solid (always solved)

👁

GPT-5.5Gemini 3.5 FlashClaude Opus 4.7Claude Opus 4.6GPT-5.4Claude Sonnet 4.6Kimi K2.6 [W&B]Kimi K2.6 [Moonshot AI]DeepSeek-V4-Pro [W&B]MiniMax M2.7Gemini 3.1 Pro PreviewDeepSeek-V4-Flash [W&B]Kimi K2.5 (int4) [W&B]GLM-5-TurboKimi K2.5 (nvfp4) [W&B]GLM-5-FP8 [W&B]MiniMax M2.5 [W&B]Gemini 3 Flash PreviewGLM-5.1 [W&B]GPT-5.3-CodexNVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]Gemma 4 31B [W&B]GPT‑5.4 miniGemini 3.1 Flash Lite PreviewMistral Small 4 119B A6BGPT‑5.4 nano

⇅

T2 = Terminus-2CC = Claude CodeHA = Hermes AgentOC = OpenClawCA = Cursor

0%10%20%30%40%50%60%70%80%90%100%

GPT-5.5

T2
🧠 xhigh

■60% ▼74% ∅77% ▲80% ★92%

T2
🧠 medium

■49% ▼67% ∅69% ▲72% ★87%

HA
v2026.4.23
🧠 xhigh

■61% ▼72% ∅74% ▲78% ★85%

HA
v2026.4.23
🧠 medium

■55% ▼71% ∅73% ▲75% ★85%

OC
2026.4.23
🧠 off

■52% ▼65% ∅70% ▲74% ★88%

CA
2026.04.17
🧠 high

■64% ▼73% ∅77% ▲80% ★90%

Gemini 3.5 Flash

T2
🧠 high

■53% ▼69% ∅71% ▲75% ★89%

T2
🧠 medium

■51% ▼60% ∅71% ▲79% ★87%

HA
v2026.5.16
🧠 high

■49% ▼69% ∅70% ▲72% ★87%

HA
v2026.5.16
🧠 medium

■51% ▼66% ∅70% ▲75% ★87%

HA
v2026.5.16
🧠 low

■45% ▼58% ∅64% ▲67% ★82%

OC
2026.4.23
🧠 high

■55% ▼65% ∅72% ▲81% ★92%

OC
2026.4.23
🧠 low

■49% ▼66% ∅69% ▲73% ★85%

OC
2026.4.23
🧠 medium

■49% ▼63% ∅68% ▲73% ★83%

Claude Opus 4.7

T2
🧠 off

■57% ▼71% ∅71% ▲73% ★80%

CC
2.1.112
🧠 xhigh

■56% ▼73% ∅73% ▲74% ★87%

HA
v2026.3.30
🧠 off

■44% ▼63% ∅66% ▲71% ★83%

OC
2026.3.11
🧠 off

■54% ▼67% ∅75% ▲81% ★91%

Claude Opus 4.6

T2
🧠 off

■55% ▼69% ∅71% ▲75% ★84%

T2
🧠 max

■46% ▼55% ∅59% ▲62% ★74%

CC
2.1.63
🧠 high

■45% ▼58% ∅63% ▲67% ★81%

CC
2.1.75
🧠 max

■44% ▼57% ∅60% ▲61% ★72%

HA
v2026.3.30
🧠 off

■49% ▼62% ∅64% ▲67% ★83%

OC
2026.3.1
🧠 medium

■42% ▼56% ∅58% ▲58% ★74%

OC
2026.3.11
🧠 max

■39% ▼53% ∅57% ▲60% ★74%

CA
2026.04.16
🧠 high

■44% ▼57% ∅63% ▲67% ★82%

GPT-5.4

T2
🧠 xhigh

■48% ▼64% ∅69% ▲73% ★83%

T2
🧠 off

■28% ▼42% ∅44% ▲47% ★61%

HA
v2026.3.30
🧠 medium

■47% ▼64% ∅66% ▲71% ★83%

OC
2026.3.11
🧠 xhigh

■52% ▼70% ∅71% ▲72% ★85%

OC
2026.3.11
🧠 low

■45% ▼57% ∅61% ▲66% ★76%

OC
2026.3.1
🧠 off

■9% ▼28% ∅30% ▲33% ★52%

Claude Sonnet 4.6

■42% ▼60% ∅62% ▲64% ★81%

CC
2.1.63

■40% ▼54% ∅58% ▲63% ★75%

OC
2026.3.1

■36% ▼48% ∅53% ▲56% ★70%

Kimi K2.6 [W&B]

■46% ▼54% ∅60% ▲66% ★71%

HA
v2026.3.30

■33% ▼48% ∅56% ▲62% ★73%

Kimi K2.6 [Moonshot AI]

■39% ▼55% ∅59% ▲63% ★73%

HA
v2026.3.30

■13% ▼20% ∅47% ▲64% ★72%

OC
2026.3.11

■42% ▼54% ∅59% ▲63% ★72%

DeepSeek-V4-Pro [W&B]

■45% ▼55% ∅57% ▲60% ★70%

HA
v2026.3.30

■17% ▼29% ∅35% ▲42% ★54%

MiniMax M2.7

■31% ▼47% ∅52% ▲55% ★66%

OC
2026.3.11

■27% ▼42% ∅46% ▲49% ★65%

Gemini 3.1 Pro Preview

■30% ▼48% ∅52% ▲56% ★69%

OC
2026.3.11

■39% ▼56% ∅59% ▲63% ★74%

DeepSeek-V4-Flash [W&B]

■31% ▼48% ∅51% ▲53% ★66%

HA
v2026.3.30

■20% ▼42% ∅43% ▲46% ★65%

Kimi K2.5 (int4) [W&B]

■31% ▼46% ∅48% ▲52% ★63%

OC
2026.3.1

■13% ▼35% ∅39% ▲45% ★58%

GLM-5-Turbo

■42% ▼46% ∅48% ▲49% ★54%

OC
2026.3.11

■26% ▼44% ∅47% ▲49% ★70%

Kimi K2.5 (nvfp4) [W&B]

■29% ▼46% ∅47% ▲49% ★64%

HA
v2026.3.30

■22% ▼39% ∅41% ▲45% ★58%

OC
2026.3.1

■15% ▼34% ∅37% ▲38% ★61%

GLM-5-FP8 [W&B]

■28% ▼44% ∅47% ▲52% ★63%

OC
2026.3.11

■17% ▼31% ∅37% ▲39% ★53%

MiniMax M2.5 [W&B]

■27% ▼42% ∅47% ▲51% ★60%

OC
2026.3.11

■20% ▼33% ∅37% ▲43% ★54%

Gemini 3 Flash Preview

■24% ▼42% ∅44% ▲48% ★64%

OC
2026.3.11

■22% ▼36% ∅41% ▲46% ★60%

GLM-5.1 [W&B]

■28% ▼39% ∅42% ▲47% ★58%

HA
v2026.3.30

■29% ▼42% ∅44% ▲47% ★60%

OC
2026.3.11

■12% ▼26% ∅33% ▲39% ★55%

GPT-5.3-Codex

■22% ▼38% ∅39% ▲42% ★57%

OC
2026.3.1

■39% ▼54% ∅55% ▲56% ★73%

NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]

■18% ▼31% ∅36% ▲39% ★53%

OC
2026.3.1

■8% ▼17% ∅20% ▲24% ★38%

Gemma 4 31B [W&B]

■19% ▼30% ∅31% ▲33% ★45%

OC
2026.3.11

■8% ▼17% ∅18% ▲19% ★27%

GPT‑5.4 mini

■17% ▼26% ∅26% ▲27% ★36%

OC
2026.3.11

■4% ▼10% ∅14% ▲18% ★28%

Gemini 3.1 Flash Lite Preview

■10% ▼21% ∅25% ▲28% ★42%

OC
2026.3.11

■11% ▼20% ∅23% ▲26% ★38%

Mistral Small 4 119B A6B

■16% ▼21% ∅24% ▲26% ★33%

OC
2026.3.11

■10% ▼16% ∅17% ▲18% ★25%

GPT‑5.4 nano

■9% ▼20% ∅22% ▲24% ★37%

OC
2026.3.11

■7% ▼12% ∅14% ▲17% ★24%

Run Details (375 runs)

Across these runs, 88 (99%) of the 89 tasks were solved at least once, 0 (0%) were solved every time, and 1 (1%) were never solved.

Date	Agent	Provider	Vendor	Model	Think	Score	Pass	Fail	Timeout	Timeouts	Err	Duration	In	Out	Total
2026-05-28 01:28	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	low	73.0%	65	23	3600s	6	1	2h12m	204.3M	1.8M	206.1M
2026-05-27 22:50	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	low	66.3%	59	30	3600s	4	0	2h37m	175.5M	1.8M	177.3M
2026-05-27 19:09	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	low	68.5%	61	28	3600s	5	0	3h40m	165.0M	1.7M	166.7M
2026-05-27 16:15	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	low	71.9%	64	24	3600s	7	1	2h54m	233.7M	1.8M	235.4M
2026-05-27 13:11	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	low	67.4%	60	29	3600s	7	0	3h03m	355.3M	2.4M	357.7M
2026-05-27 06:24	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	low	65.2%	58	31	3600s	0	0	1h25m	132.7M	625K	133.3M
2026-05-27 05:16	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	low	58.4%	52	37	3600s	0	0	1h08m	119.6M	555K	120.2M
2026-05-27 03:56	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	low	67.4%	60	29	3600s	0	0	1h19m	136.5M	504K	137.0M
2026-05-27 02:44	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	low	62.9%	56	33	3600s	0	0	1h12m	148.3M	597K	148.9M
2026-05-27 01:35	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	low	66.3%	59	30	3600s	0	0	1h08m	122.1M	553K	122.7M
2026-05-26 13:31	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	high	70.8%	63	26	3600s	4	0	2h28m	221.9M	1.9M	223.8M
2026-05-26 10:25	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	high	73.0%	65	23	3600s	6	1	3h05m	276.3M	1.8M	278.1M
2026-05-26 06:45	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	high	80.9%	72	17	3600s	8	0	3h40m	257.8M	1.9M	259.7M
2026-05-26 04:22	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	high	71.9%	64	25	3600s	5	0	2h22m	175.5M	1.8M	177.3M
2026-05-26 00:39	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	high	65.2%	58	31	3600s	5	0	3h42m	233.7M	1.8M	235.5M
2026-05-25 09:27	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	high	68.5%	61	28	3600s	1	0	1h06m	541.4M	2.9M	544.4M
2026-05-25 08:22	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	high	69.7%	62	27	3600s	4	0	1h05m	801.8M	3.9M	805.7M
2026-05-25 07:15	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	high	69.7%	62	27	3600s	2	0	1h06m	485.4M	3.1M	488.5M
2026-05-25 05:40	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	high	69.7%	62	27	3600s	0	0	1h35m	176.5M	751K	177.3M
2026-05-25 04:13	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	high	70.8%	63	26	3600s	0	0	1h27m	190.3M	728K	191.0M
2026-05-25 02:55	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	high	70.8%	63	26	3600s	0	0	1h17m	178.1M	774K	178.8M
2026-05-25 01:27	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	high	68.5%	61	28	3600s	0	0	1h27m	181.0M	682K	181.6M
2026-05-25 00:56	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	high	74.2%	66	23	3600s	0	0	0h31m	400.8M	2.8M	403.6M
2026-05-24 16:15	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	high	71.9%	64	25	3600s	0	0	1h33m	186.1M	724K	186.8M
2026-05-24 15:09	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	high	75.3%	67	22	3600s	3	0	1h06m	861.9M	3.0M	864.9M
2026-05-24 01:13	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	medium	66.3%	59	30	3600s	5	0	2h13m	180.2M	1.5M	181.7M
2026-05-23 22:30	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	medium	62.9%	56	32	3600s	7	1	2h43m	285.6M	2.0M	287.5M
2026-05-23 19:46	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	medium	73.0%	65	24	3600s	7	0	2h44m	160.7M	1.7M	162.3M
2026-05-23 17:25	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	medium	71.9%	64	25	3600s	6	0	2h21m	228.9M	1.9M	230.8M
2026-05-23 15:04	OpenClaw (2026.4.23)	google	google	Gemini 3.5 Flash	medium	66.3%	59	30	3600s	5	0	2h20m	154.4M	1.5M	155.9M
2026-05-21 10:01	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	medium	75.3%	67	21	3600s	0	1	2h08m	167.6M	764K	168.3M
2026-05-21 07:56	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	medium	67.4%	60	29	3600s	0	0	2h04m	169.9M	742K	170.6M
2026-05-21 05:55	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	medium	70.8%	63	26	3600s	0	0	2h00m	198.7M	716K	199.5M
2026-05-21 03:55	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	medium	66.3%	59	28	3600s	0	2	1h59m	169.4M	629K	170.0M
2026-05-21 01:55	Hermes Agent (v2026.5.16)	google	google	Gemini 3.5 Flash	medium	69.7%	62	27	3600s	0	0	1h59m	168.0M	715K	168.7M
2026-05-20 13:37	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	medium	59.6%	53	34	3600s	1	2	1h10m	541.6M	2.4M	544.0M
2026-05-20 03:05	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	medium	70.8%	63	23	3600s	2	3	1h11m	774.4M	3.4M	777.7M
2026-05-20 02:04	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	medium	71.9%	64	23	3600s	0	2	1h00m	315.4M	2.3M	317.7M
2026-05-19 22:53	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	medium	71.9%	64	23	3600s	2	2	2h00m	682.9M	2.9M	685.8M
2026-05-19 19:27	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.5 Flash	medium	78.7%	70	18	3600s	2	1	1h05m	495.9M	2.5M	498.4M
2026-05-19 00:29	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.6 [W&B]	-	60.7%	54	34	3600s	9	1	1h30m	212.0M	4.6M	216.6M
2026-05-16 12:03	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.6 [W&B]	-	57.3%	51	37	3600s	8	1	1h01m	243.5M	5.9M	249.4M
2026-05-16 09:02	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.6 [W&B]	-	59.6%	53	36	3600s	5	0	3h00m	77.8M	5.1M	82.9M
2026-05-16 08:01	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.6 [W&B]	-	62.9%	56	32	3600s	8	1	1h01m	193.1M	5.9M	198.9M
2026-05-16 04:31	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.6 [W&B]	-	48.3%	43	46	3600s	5	0	3h29m	77.7M	5.7M	83.4M
2026-05-16 03:29	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.6 [W&B]	-	53.9%	48	41	3600s	10	0	1h01m	228.5M	5.1M	233.7M
2026-05-16 00:40	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.6 [W&B]	-	60.7%	54	35	3600s	6	0	2h49m	76.6M	4.9M	81.5M
2026-05-15 23:39	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.6 [W&B]	-	66.3%	59	29	3600s	7	1	1h01m	191.8M	5.4M	197.2M
2026-05-15 20:38	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.6 [W&B]	-	49.4%	44	45	3600s	4	0	3h00m	78.8M	4.8M	83.6M
2026-05-15 13:10	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.6 [W&B]	-	61.8%	55	34	3600s	5	0	3h08m	84.0M	5.7M	89.8M
2026-05-14 12:49	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	59.6%	53	34	3600s	19	2	1h42m	154.8M	2.0M	156.8M
2026-05-14 09:14	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	32.6%	29	59	3600s	9	1	3h34m	65.8M	798K	66.6M
2026-05-14 07:01	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	56.2%	50	38	3600s	22	1	2h12m	143.3M	1.9M	145.2M
2026-05-14 03:32	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	41.6%	37	52	3600s	15	0	3h29m	63.7M	761K	64.5M
2026-05-14 01:42	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	55.1%	49	40	3600s	21	0	1h50m	135.4M	1.9M	137.3M
2026-05-13 22:37	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	34.8%	31	58	3600s	6	0	3h04m	57.0M	603K	57.6M
2026-05-13 20:17	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	55.1%	49	38	3600s	30	2	2h19m	125.9M	1.7M	127.6M
2026-05-13 16:48	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	29.2%	26	56	3600s	6	7	3h29m	35.3M	422K	35.7M
2026-05-13 14:25	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	59.6%	53	32	3600s	20	4	2h14m	139.1M	2.0M	141.2M
2026-05-13 11:19	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Pro [W&B]	-	34.8%	31	51	3600s	4	7	2h49m	66.0M	793K	66.8M
2026-05-12 16:25	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	41.6%	37	52	3600s	3	0	4h15m	70.6M	973K	71.6M
2026-05-12 00:12	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	52.8%	47	38	3600s	14	4	1h38m	223.0M	2.7M	225.6M
2026-05-11 11:10	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	48.3%	43	45	3600s	18	1	1h56m	227.0M	2.8M	229.9M
2026-05-11 07:04	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	42.7%	38	50	3600s	7	1	4h05m	92.8M	1.2M	94.0M
2026-05-11 05:15	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	48.3%	43	45	3600s	13	1	1h49m	204.9M	2.6M	207.5M
2026-05-11 01:37	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	42.7%	38	50	3600s	3	1	3h37m	81.6M	1.1M	82.7M
2026-05-10 23:22	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	51.7%	46	42	3600s	11	1	2h15m	269.6M	2.8M	272.5M
2026-05-10 19:55	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	43.8%	39	49	3600s	6	1	3h27m	88.2M	1.2M	89.4M
2026-05-10 18:16	Terminus-2 (2.0.0)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	51.7%	46	42	3600s	14	1	1h38m	221.5M	2.5M	224.0M
2026-05-10 13:47	Hermes Agent (v2026.3.30)	wandb	deepseek-ai	DeepSeek-V4-Flash [W&B]	-	46.1%	41	46	3600s	3	2	4h28m	88.5M	1.1M	89.6M
2026-04-27 06:24	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	xhigh	73.0%	65	24	3600s	1	0	1h07m	-	-	-
2026-04-27 05:15	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	xhigh	77.5%	69	20	3600s	1	0	1h08m	-	-	-
2026-04-27 04:09	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	xhigh	71.9%	64	25	3600s	1	0	1h06m	-	-	-
2026-04-27 03:00	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	xhigh	74.2%	66	23	3600s	1	0	1h08m	-	-	-
2026-04-27 01:53	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	xhigh	74.2%	66	23	3600s	1	0	1h07m	-	-	-
2026-04-26 11:01	Cursor (2026.04.17)	cursor	cursor	GPT-5.5	high	79.8%	71	18	3600s	1	0	1h06m	97.5M	705K	98.2M
2026-04-26 10:17	Cursor (2026.04.17)	cursor	cursor	GPT-5.5	high	74.2%	66	23	3600s	0	0	0h44m	100.1M	755K	100.9M
2026-04-26 08:17	Cursor (2026.04.17)	cursor	cursor	GPT-5.5	high	79.8%	71	17	3600s	2	1	1h59m	92.1M	659K	92.8M
2026-04-26 06:59	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	medium	70.8%	63	26	3600s	1	0	1h05m	-	-	-
2026-04-26 06:21	Cursor (2026.04.17)	cursor	cursor	GPT-5.5	high	79.8%	71	18	3600s	1	0	1h56m	85.7M	665K	86.3M
2026-04-26 05:54	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	medium	70.8%	63	26	3600s	1	0	1h04m	-	-	-
2026-04-26 04:58	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	medium	75.3%	67	22	3600s	0	0	0h55m	-	-	-
2026-04-26 04:05	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	medium	71.9%	64	25	3600s	0	0	0h53m	-	-	-
2026-04-26 03:34	Cursor (2026.04.17)	cursor	cursor	GPT-5.5	high	73.0%	65	23	3600s	1	1	2h46m	84.3M	617K	85.0M
2026-04-26 03:09	Hermes Agent (v2026.4.23)	openai	openai	GPT-5.5	medium	75.3%	67	22	3600s	0	0	0h56m	-	-	-
2026-04-25 21:26	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	xhigh	75.3%	67	22	3600s	5	0	1h15m	21.9M	2.3M	24.2M
2026-04-25 20:08	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	xhigh	79.8%	71	17	3600s	7	1	1h17m	29.7M	2.7M	32.4M
2026-04-25 18:28	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	xhigh	74.2%	66	22	3600s	7	1	1h40m	24.5M	2.3M	26.8M
2026-04-25 17:15	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	xhigh	78.7%	70	19	3600s	5	0	1h12m	27.8M	2.4M	30.2M
2026-04-25 15:58	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	xhigh	76.4%	68	21	3600s	4	0	1h16m	20.3M	2.4M	22.7M
2026-04-25 12:44	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	medium	68.5%	61	28	3600s	2	0	1h02m	33.7M	888K	34.6M
2026-04-25 11:43	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	medium	71.9%	64	25	3600s	1	0	1h00m	30.7M	834K	31.5M
2026-04-25 10:02	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	medium	68.5%	61	27	3600s	3	1	1h40m	122.3M	902K	123.2M
2026-04-25 08:56	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	medium	67.4%	60	29	3600s	2	0	1h06m	25.6M	818K	26.4M
2026-04-25 07:46	Terminus-2 (2.0.0)	openai	openai	GPT-5.5	medium	69.7%	62	27	3600s	4	0	1h09m	58.3M	1.0M	59.3M
2026-04-25 06:39	OpenClaw (2026.4.23)	openai	openai	GPT-5.5	off	65.2%	58	31	3600s	6	0	1h06m	16.3M	189K	16.5M
2026-04-25 05:32	OpenClaw (2026.4.23)	openai	openai	GPT-5.5	off	70.8%	63	26	3600s	5	0	1h06m	15.2M	150K	15.3M
2026-04-25 04:25	OpenClaw (2026.4.23)	openai	openai	GPT-5.5	off	69.7%	62	27	3600s	6	0	1h07m	18.7M	198K	18.9M
2026-04-25 02:45	OpenClaw (2026.4.23)	openai	openai	GPT-5.5	off	74.2%	66	22	3600s	7	1	1h40m	21.2M	183K	21.4M
2026-04-25 01:50	Hermes Agent (v2026.3.30)	wandbqa	moonshotai	Kimi K2.6 [Moonshot AI]	-	20.2%	18	70	3600s	55	1	3h09m	-	-	-
2026-04-25 01:38	OpenClaw (2026.4.23)	openai	openai	GPT-5.5	off	71.9%	64	25	3600s	6	0	1h06m	24.8M	176K	25.0M
2026-04-24 21:17	Hermes Agent (v2026.3.30)	wandbqa	moonshotai	Kimi K2.6 [Moonshot AI]	-	37.1%	33	56	3600s	49	0	4h32m	-	-	-
2026-04-23 10:40	Hermes Agent (v2026.3.30)	moonshotai	moonshotai	Kimi K2.6 [Moonshot AI]	-	57.3%	51	38	3600s	13	0	3h50m	-	-	-
2026-04-23 06:22	Hermes Agent (v2026.3.30)	moonshotai	moonshotai	Kimi K2.6 [Moonshot AI]	-	64.0%	57	32	3600s	14	0	4h17m	-	-	-
2026-04-23 02:01	Hermes Agent (v2026.3.30)	wandb	zai-org	GLM-5.1 [W&B]	-	41.6%	37	47	3600s	6	5	5h48m	72.9M	768K	73.6M
2026-04-23 01:45	Hermes Agent (v2026.3.30)	moonshotai	moonshotai	Kimi K2.6 [Moonshot AI]	-	57.3%	51	35	3600s	13	3	4h36m	-	-	-
2026-04-22 20:52	Hermes Agent (v2026.3.30)	wandb	zai-org	GLM-5.1 [W&B]	-	47.2%	42	44	3600s	4	3	5h08m	63.4M	779K	64.1M
2026-04-22 14:28	Hermes Agent (v2026.3.30)	wandb	zai-org	GLM-5.1 [W&B]	-	42.7%	38	45	3600s	4	6	6h23m	60.4M	708K	61.1M
2026-04-22 08:12	Hermes Agent (v2026.3.30)	wandb	zai-org	GLM-5.1 [W&B]	-	42.7%	38	44	3600s	4	7	6h16m	70.9M	789K	71.7M
2026-04-21 18:57	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.7	off	65.2%	58	30	3600s	3	1	4h41m	-	-	-
2026-04-21 18:27	Terminus-2 (2.0.0)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	60.7%	54	35	3600s	15	0	1h55m	117.5M	3.2M	120.7M
2026-04-21 15:55	Terminus-2 (2.0.0)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	55.1%	49	39	3600s	12	1	2h32m	108.5M	2.8M	111.3M
2026-04-21 14:42	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.7	off	65.2%	58	30	3600s	6	1	4h15m	-	-	-
2026-04-21 14:07	Terminus-2 (2.0.0)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	62.9%	56	33	3600s	14	0	1h47m	112.2M	2.8M	115.0M
2026-04-21 11:47	Terminus-2 (2.0.0)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	58.4%	52	37	3600s	13	0	2h20m	100.0M	2.8M	102.9M
2026-04-21 10:08	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.7	off	62.9%	56	30	3600s	7	3	4h33m	-	-	-
2026-04-21 09:59	Terminus-2 (2.0.0)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	57.3%	51	38	3600s	14	0	1h47m	108.4M	3.0M	111.4M
2026-04-21 06:45	OpenClaw (2026.3.11)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	53.9%	48	40	3600s	13	1	3h13m	168.9M	3.8M	172.7M
2026-04-21 05:29	OpenClaw (2026.3.11)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	56.2%	50	39	3600s	11	0	1h15m	171.8M	3.4M	175.2M
2026-04-21 05:00	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.7	off	67.4%	60	28	3600s	5	1	5h07m	-	-	-
2026-04-21 03:49	OpenClaw (2026.3.11)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	62.9%	56	32	3600s	12	1	1h40m	203.6M	4.2M	207.8M
2026-04-21 02:32	OpenClaw (2026.3.11)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	60.7%	54	35	3600s	11	0	1h16m	161.6M	3.5M	165.1M
2026-04-21 00:51	OpenClaw (2026.3.11)	openrouter	moonshotai	Kimi K2.6 [Moonshot AI]	-	59.6%	53	35	3600s	14	1	1h41m	212.0M	4.3M	216.3M
2026-04-21 00:32	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.7	off	70.8%	63	25	3600s	3	1	4h28m	-	-	-
2026-04-18 01:01	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.7	off	76.4%	68	20	3600s	7	1	1h40m	122.7M	1.3M	124.0M
2026-04-17 23:30	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.7	off	70.8%	63	26	3600s	2	0	1h03m	112.0M	998K	113.0M
2026-04-17 22:38	Cursor (2026.04.16)	cursor	cursor	Claude Opus 4.6	high	61.8%	55	33	3600s	10	1	1h18m	80.4M	863K	81.3M
2026-04-17 21:52	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.7	off	77.5%	69	20	3600s	7	0	1h38m	215.6M	1.7M	217.3M
2026-04-17 21:15	Cursor (2026.04.16)	cursor	cursor	Claude Opus 4.6	high	60.7%	54	34	3600s	8	1	1h22m	94.3M	918K	95.2M
2026-04-17 20:47	Claude Code (2.1.112)	anthropic	anthropic	Claude Opus 4.7	xhigh	73.0%	65	24	3600s	3	0	1h05m	219.0M	2.5M	221.5M
2026-04-17 19:42	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.7	off	73.0%	65	24	3600s	1	0	1h03m	144.3M	1.0M	145.4M
2026-04-17 18:31	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.7	off	67.4%	60	29	3600s	4	0	1h10m	154.1M	1.4M	155.5M
2026-04-17 18:04	Cursor (2026.04.16)	cursor	cursor	Claude Opus 4.6	high	57.3%	51	37	3600s	9	1	1h16m	74.6M	1.2M	75.9M
2026-04-17 16:57	Cursor (2026.04.16)	cursor	cursor	Claude Opus 4.6	high	66.3%	59	29	3600s	6	1	1h06m	119.3M	1.5M	120.8M
2026-04-17 16:52	Claude Code (2.1.112)	anthropic	anthropic	Claude Opus 4.7	xhigh	73.0%	65	24	3600s	6	0	1h39m	214.0M	2.3M	216.3M
2026-04-17 12:52	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.7	off	70.8%	63	25	3600s	4	1	1h40m	193.8M	1.3M	195.1M
2026-04-17 11:48	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.7	off	74.2%	66	23	3600s	3	0	1h04m	143.9M	1.3M	145.2M
2026-04-17 11:37	Cursor (2026.04.16)	cursor	cursor	Claude Opus 4.6	high	67.4%	60	27	3600s	6	2	1h15m	137.7M	1.6M	139.4M
2026-04-17 10:09	Claude Code (2.1.112)	anthropic	anthropic	Claude Opus 4.7	xhigh	73.0%	65	24	3600s	7	0	1h39m	219.8M	2.3M	222.1M
2026-04-17 08:08	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.7	off	71.9%	64	24	3600s	5	1	2h01m	156.6M	1.2M	157.7M
2026-04-17 05:31	Claude Code (2.1.112)	anthropic	anthropic	Claude Opus 4.7	xhigh	74.2%	66	18	3600s	5	5	1h19m	269.4M	2.5M	271.9M
2026-04-17 04:04	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.7	off	70.8%	63	25	3600s	3	1	1h27m	139.4M	1.2M	140.5M
2026-04-17 02:46	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.7	off	80.9%	72	17	3600s	5	0	1h17m	129.8M	1.2M	131.0M
2026-04-17 01:18	Claude Code (2.1.112)	anthropic	anthropic	Claude Opus 4.7	xhigh	74.2%	66	23	3600s	5	0	1h28m	237.8M	2.4M	240.1M
2026-04-15 13:59	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5-FP8 [W&B]	-	44.9%	40	47	3600s	17	2	2h18m	89.9M	1.6M	91.5M
2026-04-15 07:56	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5-FP8 [W&B]	-	48.3%	43	44	3600s	20	2	2h35m	101.4M	1.7M	103.2M
2026-04-15 06:01	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5-FP8 [W&B]	-	47.2%	42	46	3600s	20	1	1h54m	88.3M	1.7M	90.0M
2026-04-15 04:01	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5-FP8 [W&B]	-	51.7%	46	42	3600s	14	1	1h59m	72.8M	1.6M	74.4M
2026-04-15 01:29	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5-FP8 [W&B]	-	43.8%	39	48	3600s	19	2	2h32m	94.1M	1.7M	95.8M
2026-04-15 00:23	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5.1 [W&B]	-	31.5%	28	61	3600s	2	0	1h04m	24.0M	461K	24.5M
2026-04-14 21:17	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.1 [W&B]	-	43.8%	39	45	3600s	29	5	2h45m	25.5M	937K	26.5M
2026-04-14 18:51	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.1 [W&B]	-	41.6%	37	52	3600s	38	0	2h25m	24.0M	925K	24.9M
2026-04-14 17:10	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5.1 [W&B]	-	36.0%	32	56	3600s	7	1	1h40m	27.0M	502K	27.5M
2026-04-14 07:52	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.1 [W&B]	-	39.3%	35	54	3600s	44	0	3h21m	13.1M	642K	13.7M
2026-04-14 03:58	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.1 [W&B]	-	40.4%	36	50	3600s	37	3	3h54m	21.3M	868K	22.2M
2026-04-14 01:20	Terminus-2 (2.0.0)	wandb	zai-org	GLM-5.1 [W&B]	-	47.2%	42	46	3600s	40	1	2h38m	22.8M	856K	23.7M
2026-04-13 23:48	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5.1 [W&B]	-	25.8%	23	66	3600s	4	0	1h31m	22.3M	395K	22.7M
2026-04-13 22:07	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5.1 [W&B]	-	30.3%	27	61	3600s	5	1	1h40m	18.5M	414K	18.9M
2026-04-13 20:26	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5.1 [W&B]	-	39.3%	35	53	3600s	10	1	1h40m	51.5M	658K	52.2M
2026-04-09 11:12	Terminus-2 (2.0.0)	wandb	google	Gemma 4 31B [W&B]	-	31.5%	28	59	3600s	8	2	1h41m	163.6M	1.3M	164.8M
2026-04-09 09:31	Terminus-2 (2.0.0)	wandb	google	Gemma 4 31B [W&B]	-	31.5%	28	60	3600s	13	1	1h40m	217.1M	1.5M	218.6M
2026-04-09 08:03	Terminus-2 (2.0.0)	wandb	google	Gemma 4 31B [W&B]	-	30.3%	27	62	3600s	11	0	1h27m	222.8M	1.2M	224.0M
2026-04-09 06:49	Terminus-2 (2.0.0)	wandb	google	Gemma 4 31B [W&B]	-	32.6%	29	59	3600s	12	1	1h13m	188.2M	1.5M	189.7M
2026-04-09 05:08	OpenClaw (2026.3.11)	wandb	google	Gemma 4 31B [W&B]	-	19.1%	17	71	3600s	5	1	1h40m	147.0M	1.3M	148.3M
2026-04-09 03:57	OpenClaw (2026.3.11)	wandb	google	Gemma 4 31B [W&B]	-	18.0%	16	73	3600s	2	0	1h11m	203.5M	1.5M	205.0M
2026-04-09 01:52	OpenClaw (2026.3.11)	wandb	google	Gemma 4 31B [W&B]	-	19.1%	17	71	3600s	8	1	2h04m	200.5M	1.4M	201.9M
2026-04-09 00:25	OpenClaw (2026.3.11)	wandb	google	Gemma 4 31B [W&B]	-	16.9%	15	74	3600s	3	0	1h26m	179.6M	1.6M	181.2M
2026-04-08 23:18	OpenClaw (2026.3.11)	wandb	google	Gemma 4 31B [W&B]	-	18.0%	16	73	3600s	7	0	1h07m	124.2M	1.5M	125.7M
2026-04-06 07:52	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Flash Lite Preview	-	24.7%	22	66	3600s	3	1	1h40m	172.0M	2.2M	174.2M
2026-04-06 06:11	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Flash Lite Preview	-	28.1%	25	63	3600s	2	1	1h40m	174.2M	1.5M	175.7M
2026-04-06 05:06	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Flash Lite Preview	-	25.8%	23	66	3600s	2	0	1h05m	96.6M	2.0M	98.5M
2026-04-06 03:25	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Flash Lite Preview	-	21.3%	19	69	3600s	2	1	1h40m	156.2M	1.9M	158.1M
2026-04-06 02:22	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Flash Lite Preview	-	25.8%	23	66	3600s	2	0	1h02m	223.3M	2.4M	225.7M
2026-04-06 01:18	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Flash Lite Preview	-	20.2%	18	71	3600s	5	0	1h04m	239.4M	837K	240.3M
2026-04-05 23:37	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Flash Lite Preview	-	21.3%	19	69	3600s	7	1	1h40m	177.6M	745K	178.4M
2026-04-05 21:45	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Flash Lite Preview	-	22.5%	20	69	3600s	4	0	1h51m	162.3M	741K	163.0M
2026-04-05 20:04	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Flash Lite Preview	-	24.7%	22	66	3600s	5	1	1h40m	255.9M	830K	256.7M
2026-04-05 17:06	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Flash Lite Preview	-	25.8%	23	66	3600s	6	0	2h57m	123.1M	697K	123.8M
2026-04-05 11:05	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3 Flash Preview	-	41.6%	37	51	3600s	4	1	1h40m	284.7M	1.2M	285.9M
2026-04-05 09:59	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3 Flash Preview	-	41.6%	37	52	3600s	5	0	1h05m	270.6M	1.2M	271.8M
2026-04-05 08:19	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3 Flash Preview	-	46.1%	41	47	3600s	3	1	1h40m	310.4M	1.3M	311.8M
2026-04-05 07:15	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3 Flash Preview	-	48.3%	43	45	3600s	4	1	1h03m	490.5M	1.6M	492.1M
2026-04-05 06:09	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3 Flash Preview	-	43.8%	39	50	3600s	5	0	1h05m	252.4M	1.2M	253.6M
2026-04-05 04:28	OpenClaw (2026.3.11)	google	google	Gemini 3 Flash Preview	-	40.4%	36	52	3600s	9	1	1h40m	210.1M	472K	210.6M
2026-04-05 02:47	OpenClaw (2026.3.11)	google	google	Gemini 3 Flash Preview	-	36.0%	32	56	3600s	7	1	1h40m	377.6M	653K	378.3M
2026-04-05 01:06	OpenClaw (2026.3.11)	google	google	Gemini 3 Flash Preview	-	46.1%	41	47	3600s	9	1	1h40m	265.0M	753K	265.8M
2026-04-04 23:25	OpenClaw (2026.3.11)	google	google	Gemini 3 Flash Preview	-	40.4%	36	52	3600s	7	1	1h40m	210.4M	670K	211.1M
2026-04-04 21:44	OpenClaw (2026.3.11)	google	google	Gemini 3 Flash Preview	-	40.4%	36	52	3600s	9	1	1h40m	920.6M	1.2M	921.8M
2026-04-03 11:44	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Pro Preview	-	50.6%	45	44	3600s	0	0	0h31m	15.4M	653K	16.1M
2026-04-03 11:08	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Pro Preview	-	56.2%	50	39	3600s	0	0	0h35m	15.6M	568K	16.2M
2026-04-03 10:41	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Pro Preview	-	52.8%	47	42	3600s	0	0	0h26m	23.7M	700K	24.4M
2026-04-03 09:01	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Pro Preview	-	50.6%	45	43	3600s	1	1	1h40m	13.1M	634K	13.7M
2026-04-03 07:58	Terminus-2 (2.0.0)	gemini	gemini	Gemini 3.1 Pro Preview	-	48.3%	43	45	3600s	1	1	1h02m	19.4M	670K	20.1M
2026-04-03 06:53	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Pro Preview	-	59.6%	53	36	3600s	6	0	1h05m	228.7M	638K	229.3M
2026-04-03 05:46	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Pro Preview	-	57.3%	51	38	3600s	5	0	1h06m	226.5M	748K	227.2M
2026-04-03 04:05	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Pro Preview	-	60.7%	54	34	3600s	7	1	1h40m	131.0M	652K	131.7M
2026-04-03 02:24	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Pro Preview	-	62.9%	56	32	3600s	8	1	1h40m	239.2M	696K	239.9M
2026-04-03 01:18	OpenClaw (2026.3.11)	google	google	Gemini 3.1 Pro Preview	-	56.2%	50	39	3600s	6	0	1h06m	102.8M	485K	103.3M
2026-04-02 12:00	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.6	off	61.8%	55	31	3600s	5	3	6h17m	68.4M	1.0M	69.4M
2026-04-02 07:39	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	39.3%	35	52	3600s	6	2	5h23m	75.8M	1.6M	77.4M
2026-04-02 07:08	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.6	off	64.0%	57	32	3600s	5	0	4h51m	75.6M	1.0M	76.7M
2026-04-02 05:24	Hermes Agent (v2026.3.30)	openai	openai	GPT-5.4	medium	70.8%	63	26	3600s	2	0	2h38m	80.2M	996K	81.2M
2026-04-02 03:25	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	40.4%	36	52	3600s	6	1	4h13m	82.0M	1.7M	83.7M
2026-04-02 02:55	Hermes Agent (v2026.3.30)	openai	openai	GPT-5.4	medium	65.2%	58	31	3600s	2	0	2h28m	70.8M	960K	71.8M
2026-04-02 00:41	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.6	off	67.4%	60	29	3600s	5	0	6h25m	76.1M	1.2M	77.3M
2026-04-01 23:30	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	39.3%	35	53	3600s	5	1	3h54m	65.4M	1.3M	66.6M
2026-04-01 20:09	Hermes Agent (v2026.3.30)	openai	openai	GPT-5.4	medium	66.3%	59	30	3600s	3	0	2h22m	86.4M	1.0M	87.4M
2026-04-01 19:52	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	44.9%	40	49	3600s	4	0	3h38m	77.7M	1.6M	79.3M
2026-04-01 19:49	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.6	off	61.8%	55	34	3600s	4	0	4h51m	69.4M	1.1M	70.5M
2026-04-01 17:47	Hermes Agent (v2026.3.30)	openai	openai	GPT-5.4	medium	65.2%	58	30	3600s	3	1	2h21m	67.2M	900K	68.1M
2026-04-01 14:45	Hermes Agent (v2026.3.30)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	42.7%	38	50	3600s	7	1	5h06m	88.1M	1.6M	89.7M
2026-04-01 14:44	Hermes Agent (v2026.3.30)	openai	openai	GPT-5.4	medium	64.0%	57	31	3600s	1	1	3h02m	64.5M	847K	65.3M
2026-04-01 14:44	Hermes Agent (v2026.3.30)	anthropic	anthropic	Claude Opus 4.6	off	64.0%	57	31	3600s	3	1	5h04m	69.5M	1.1M	70.5M
2026-03-29 07:01	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5-FP8 [W&B]	-	39.3%	35	45	3600s	6	9	2h57m	133.6M	1.1M	134.7M
2026-03-29 04:05	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5-FP8 [W&B]	-	37.1%	33	51	3600s	1	5	2h55m	91.3M	923K	92.3M
2026-03-29 01:00	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5-FP8 [W&B]	-	38.2%	34	50	3600s	3	5	3h04m	104.7M	861K	105.6M
2026-03-27 19:54	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5-FP8 [W&B]	-	31.5%	28	53	3600s	5	8	3h07m	102.7M	923K	103.6M
2026-03-27 16:16	OpenClaw (2026.3.11)	wandb	zai-org	GLM-5-FP8 [W&B]	-	37.1%	33	47	3600s	2	9	3h37m	90.2M	797K	91.0M
2026-03-27 13:20	OpenClaw (2026.3.11)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	42.7%	38	47	3600s	0	4	2h55m	69.4M	984K	70.4M
2026-03-27 11:08	OpenClaw (2026.3.11)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	37.1%	33	49	3600s	1	7	2h12m	72.7M	1.0M	73.7M
2026-03-27 08:24	OpenClaw (2026.3.11)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	37.1%	33	50	3600s	0	6	2h42m	66.5M	885K	67.4M
2026-03-27 06:53	Terminus-2 (2.0.0)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	50.6%	45	42	3600s	24	2	1h31m	74.9M	1.4M	76.4M
2026-03-27 04:47	Terminus-2 (2.0.0)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	43.8%	39	45	3600s	25	5	2h05m	114.6M	1.6M	116.2M
2026-03-27 02:58	Terminus-2 (2.0.0)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	49.4%	44	43	3600s	17	2	1h48m	84.9M	1.5M	86.4M
2026-03-26 12:31	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (nvfp4) [W&B]	-	47.2%	42	45	3600s	7	2	1h47m	306.4M	2.0M	308.4M
2026-03-26 09:37	OpenClaw (2026.3.11)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	33.7%	30	53	3600s	3	6	2h51m	72.7M	969K	73.7M
2026-03-26 07:14	OpenClaw (2026.3.11)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	32.6%	29	52	3600s	1	8	2h22m	68.1M	996K	69.1M
2026-03-26 06:07	Terminus-2 (2.0.0)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	41.6%	37	50	3600s	31	2	1h06m	64.1M	1.5M	65.5M
2026-03-26 04:26	Terminus-2 (2.0.0)	wandb	MiniMaxAI	MiniMax M2.5 [W&B]	-	49.4%	44	42	3600s	22	3	1h41m	74.7M	1.4M	76.1M
2026-03-20 06:43	Terminus-2 (2.0.0)	openrouter	minimax	MiniMax M2.7	-	49.4%	44	45	3600s	18	0	1h33m	245.0M	2.4M	247.4M
2026-03-20 03:45	Terminus-2 (2.0.0)	openrouter	minimax	MiniMax M2.7	-	55.1%	49	39	3600s	16	1	2h57m	337.1M	2.5M	339.7M
2026-03-20 02:30	OpenClaw (2026.3.11)	openrouter	minimax	MiniMax M2.7	-	49.4%	44	45	3600s	7	0	1h14m	135.9M	2.4M	138.3M
2026-03-20 00:17	OpenClaw (2026.3.11)	openrouter	minimax	MiniMax M2.7	-	48.3%	43	45	3600s	6	1	2h12m	104.3M	2.2M	106.5M
2026-03-19 13:39	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 nano	-	20.2%	18	71	3600s	32	0	2h01m	1.32B	2.6M	1.32B
2026-03-19 12:31	Terminus-2 (2.0.0)	openrouter	minimax	MiniMax M2.7	-	52.8%	47	42	3600s	15	0	1h31m	317.7M	2.7M	320.4M
2026-03-19 12:08	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 nano	-	23.6%	21	68	3600s	28	0	1h31m	1.08B	2.2M	1.08B
2026-03-19 10:50	Terminus-2 (2.0.0)	openrouter	minimax	MiniMax M2.7	-	47.2%	42	46	3600s	18	1	1h40m	249.8M	2.5M	252.2M
2026-03-19 10:43	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 nano	-	23.6%	21	68	3600s	21	0	1h24m	1.09B	2.0M	1.09B
2026-03-19 09:29	Terminus-2 (2.0.0)	mistral	mistral	Mistral Small 4 119B A6B	-	25.8%	23	59	3600s	1	7	1h54m	147.8M	1.1M	149.0M
2026-03-19 09:19	Terminus-2 (2.0.0)	openrouter	minimax	MiniMax M2.7	-	55.1%	49	40	3600s	19	0	1h30m	192.0M	2.4M	194.5M
2026-03-19 09:18	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 mini	-	27.0%	24	65	3600s	24	0	1h24m	845.5M	1.3M	846.7M
2026-03-19 08:13	Terminus-2 (2.0.0)	mistral	mistral	Mistral Small 4 119B A6B	-	21.3%	19	70	3600s	4	0	1h15m	455.3M	1.6M	456.9M
2026-03-19 07:38	OpenClaw (2026.3.11)	openrouter	minimax	MiniMax M2.7	-	46.1%	41	47	3600s	4	1	1h40m	100.8M	2.3M	103.1M
2026-03-19 07:37	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 mini	-	25.8%	23	65	3600s	21	1	1h40m	810.4M	1.3M	811.7M
2026-03-19 06:59	Terminus-2 (2.0.0)	mistral	mistral	Mistral Small 4 119B A6B	-	23.6%	21	68	3600s	4	0	1h13m	232.2M	1.5M	233.7M
2026-03-19 05:57	OpenClaw (2026.3.11)	openrouter	minimax	MiniMax M2.7	-	42.7%	38	50	3600s	6	1	1h40m	113.1M	2.5M	115.5M
2026-03-19 05:57	Terminus-2 (2.0.0)	openai	openai	GPT‑5.4 mini	-	25.8%	23	66	3600s	17	0	1h39m	847.8M	1.3M	849.1M
2026-03-19 05:53	OpenClaw (2026.3.11)	mistral	mistral	Mistral Small 4 119B A6B	-	18.0%	16	72	3600s	4	1	1h05m	110.5M	772K	111.3M
2026-03-19 04:56	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 nano	-	12.4%	11	78	3600s	1	0	1h00m	25.2M	156K	25.4M
2026-03-19 04:47	OpenClaw (2026.3.11)	mistral	mistral	Mistral Small 4 119B A6B	-	16.9%	15	74	3600s	6	0	1h05m	120.9M	842K	121.8M
2026-03-19 04:11	OpenClaw (2026.3.11)	openrouter	minimax	MiniMax M2.7	-	41.6%	37	51	3600s	3	1	1h45m	126.5M	2.3M	128.8M
2026-03-19 03:54	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 nano	-	13.5%	12	77	3600s	1	0	1h02m	19.4M	143K	19.6M
2026-03-19 03:24	OpenClaw (2026.3.11)	mistral	mistral	Mistral Small 4 119B A6B	-	15.7%	14	75	3600s	7	0	1h23m	115.9M	758K	116.7M
2026-03-19 02:53	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 nano	-	16.9%	15	74	3600s	1	0	1h00m	13.4M	123K	13.5M
2026-03-18 08:05	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 mini	-	10.1%	9	80	3600s	2	0	1h02m	20.3M	170K	20.5M
2026-03-18 07:03	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 mini	-	14.6%	13	76	3600s	3	0	1h02m	16.7M	159K	16.8M
2026-03-18 06:01	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 mini	-	14.6%	13	76	3600s	2	0	1h02m	21.0M	164K	21.2M
2026-03-18 04:58	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 mini	-	18.0%	16	73	3600s	1	0	1h02m	19.4M	162K	19.6M
2026-03-18 03:56	OpenClaw (2026.3.11)	openai	openai	GPT‑5.4 mini	-	13.5%	12	77	3600s	1	0	1h02m	16.9M	155K	17.1M
2026-03-16 13:58	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5-Turbo	-	49.4%	44	43	3600s	13	2	2h15m	361.5M	2.7M	364.2M
2026-03-16 11:53	Terminus-2 (2.0.0)	openrouter	z-ai	GLM-5-Turbo	-	46.1%	41	48	3600s	14	0	2h03m	285.8M	2.5M	288.3M
2026-03-16 10:27	OpenClaw (2026.3.11)	openrouter	z-ai	GLM-5-Turbo	-	47.2%	42	47	3600s	7	0	1h25m	117.6M	3.4M	121.0M
2026-03-16 09:17	OpenClaw (2026.3.11)	openrouter	z-ai	GLM-5-Turbo	-	46.1%	41	48	3600s	6	0	1h10m	65.6M	2.5M	68.1M
2026-03-16 07:45	OpenClaw (2026.3.11)	openrouter	z-ai	GLM-5-Turbo	-	47.2%	42	47	3600s	10	0	1h31m	72.0M	2.8M	74.8M
2026-03-16 06:04	OpenClaw (2026.3.11)	openrouter	z-ai	GLM-5-Turbo	-	49.4%	44	44	3600s	9	1	1h40m	87.2M	2.7M	89.9M
2026-03-16 04:23	OpenClaw (2026.3.11)	openrouter	z-ai	GLM-5-Turbo	-	43.8%	39	49	3600s	6	1	1h40m	116.1M	3.4M	119.5M
2026-03-16 01:49	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	31.5%	28	61	3600s	21	0	2h01m	153.9M	4.0M	157.9M
2026-03-15 23:55	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	38.2%	34	55	3600s	19	0	1h54m	150.4M	3.9M	154.3M
2026-03-15 22:13	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	31.5%	28	61	3600s	16	0	1h41m	132.0M	3.9M	135.9M
2026-03-15 20:17	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	38.2%	34	55	3600s	19	0	1h55m	151.3M	4.0M	155.3M
2026-03-15 18:06	Terminus-2 (2.0.0)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	39.3%	35	54	3600s	22	0	2h10m	177.4M	4.3M	181.7M
2026-03-15 01:29	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.6	max	56.2%	50	38	3600s	7	1	1h40m	75.5M	1.4M	76.9M
2026-03-14 23:48	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.6	max	52.8%	47	41	3600s	10	1	1h40m	87.4M	1.7M	89.1M
2026-03-14 22:09	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.6	max	59.6%	53	36	3600s	8	0	1h39m	76.4M	1.7M	78.1M
2026-03-14 20:31	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.6	max	58.4%	52	37	3600s	7	0	1h37m	90.1M	1.6M	91.7M
2026-03-14 19:40	OpenClaw (2026.3.1)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	19.1%	17	71	3600s	4	1	1h40m	72.6M	765K	73.3M
2026-03-14 18:50	OpenClaw (2026.3.11)	anthropic	anthropic	Claude Opus 4.6	max	59.6%	53	35	3600s	5	1	1h40m	100.0M	1.9M	101.8M
2026-03-14 18:18	OpenClaw (2026.3.1)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	23.6%	21	68	3600s	7	0	1h22m	83.9M	1.0M	84.9M
2026-03-14 17:45	Claude Code (2.1.75)	anthropic	anthropic	Claude Opus 4.6	max	60.7%	54	34	3600s	5	1	1h04m	146.6M	1.5M	148.1M
2026-03-14 17:09	OpenClaw (2026.3.1)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	16.9%	15	74	3600s	4	0	1h08m	54.7M	773K	55.4M
2026-03-14 16:38	Claude Code (2.1.75)	anthropic	anthropic	Claude Opus 4.6	max	57.3%	51	37	3600s	9	1	1h07m	135.0M	1.5M	136.5M
2026-03-14 15:33	OpenClaw (2026.3.1)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	21.3%	19	69	3600s	8	1	1h35m	75.5M	961K	76.5M
2026-03-14 15:20	Claude Code (2.1.75)	anthropic	anthropic	Claude Opus 4.6	max	60.7%	54	34	3600s	6	1	1h17m	132.5M	1.8M	134.3M
2026-03-14 14:20	OpenClaw (2026.3.1)	wandb	nvidia	NVIDIA-Nemotron-3-Super-120B-A12B-FP8 [W&B]	-	20.2%	18	70	3600s	4	1	1h12m	62.0M	967K	63.0M
2026-03-14 14:05	Claude Code (2.1.75)	anthropic	anthropic	Claude Opus 4.6	max	60.7%	54	34	3600s	7	1	1h15m	146.6M	1.6M	148.2M
2026-03-14 12:32	Claude Code (2.1.75)	anthropic	anthropic	Claude Opus 4.6	max	58.4%	52	36	3600s	9	1	1h32m	176.0M	1.4M	177.4M
2026-03-14 10:34	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	max	55.1%	49	39	3600s	21	1	1h57m	77.8M	2.6M	80.4M
2026-03-14 08:48	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	max	58.4%	52	36	3600s	16	1	1h45m	61.7M	2.5M	64.2M
2026-03-14 07:02	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	max	60.7%	54	35	3600s	16	0	1h45m	82.0M	2.3M	84.3M
2026-03-14 04:57	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	max	61.8%	55	34	3600s	15	0	2h04m	75.0M	2.3M	77.4M
2026-03-14 02:54	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	max	60.7%	54	35	3600s	19	0	2h02m	73.6M	2.3M	75.9M
2026-03-12 21:45	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	low	59.6%	53	35	3600s	12	1	1h40m	57.9M	577K	58.5M
2026-03-12 20:33	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	low	61.8%	55	34	3600s	11	0	1h10m	81.0M	613K	81.6M
2026-03-12 19:23	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	low	57.3%	51	38	3600s	10	0	1h10m	70.8M	602K	71.4M
2026-03-12 18:18	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	low	59.6%	53	36	3600s	10	0	1h05m	67.8M	618K	68.4M
2026-03-12 17:08	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	low	66.3%	59	30	3600s	10	0	1h09m	79.2M	603K	79.8M
2026-03-12 12:49	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	xhigh	70.8%	63	26	3600s	13	0	1h12m	141.0M	1.7M	142.7M
2026-03-12 11:27	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	xhigh	71.9%	64	25	3600s	11	0	1h21m	135.4M	1.7M	137.1M
2026-03-12 10:27	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	xhigh	67.4%	60	28	3600s	11	1	2h15m	14.7M	6.9M	21.6M
2026-03-12 10:16	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	xhigh	70.8%	63	26	3600s	13	0	1h10m	147.5M	1.8M	149.3M
2026-03-12 09:03	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	xhigh	69.7%	62	27	3600s	12	0	1h12m	156.4M	1.9M	158.3M
2026-03-12 08:39	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	xhigh	73.0%	65	24	3600s	11	0	1h47m	13.5M	6.3M	19.9M
2026-03-12 07:38	OpenClaw (2026.3.11)	openai	openai	GPT-5.4	xhigh	71.9%	64	25	3600s	10	0	1h25m	145.6M	1.6M	147.2M
2026-03-12 06:17	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	xhigh	64.0%	57	31	3600s	10	1	2h21m	12.4M	6.1M	18.5M
2026-03-12 04:57	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	xhigh	70.8%	63	26	3600s	8	0	1h19m	10.3M	5.5M	15.8M
2026-03-12 03:25	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	xhigh	69.7%	62	27	3600s	11	0	1h31m	13.3M	5.7M	19.0M
2026-03-10 12:31	Terminus-2 (2.0.0)	openai	openai	Kimi K2.5 (nvfp4) [W&B]	-	47.2%	42	47	3600s	12	0	1h27m	114.4M	1.8M	116.2M
2026-03-10 11:30	OpenClaw (2026.3.1)	openai	openai	GPT-5.3-Codex	-	53.9%	48	41	3600s	5	0	1h04m	33.1M	360K	33.5M
2026-03-10 10:15	Terminus-2 (2.0.0)	openai	openai	Kimi K2.5 (nvfp4) [W&B]	-	49.4%	44	44	3600s	11	1	2h15m	140.6M	2.2M	142.8M
2026-03-10 09:49	OpenClaw (2026.3.1)	openai	openai	GPT-5.3-Codex	-	55.1%	49	39	3600s	8	1	1h40m	31.8M	339K	32.1M
2026-03-10 08:43	OpenClaw (2026.3.1)	openai	openai	GPT-5.3-Codex	-	56.2%	50	38	3600s	7	1	1h05m	35.6M	356K	35.9M
2026-03-10 08:42	Terminus-2 (2.0.0)	openai	openai	Kimi K2.5 (nvfp4) [W&B]	-	46.1%	41	48	3600s	13	0	1h32m	138.5M	2.0M	140.5M
2026-03-10 07:38	OpenClaw (2026.3.1)	openai	openai	GPT-5.3-Codex	-	56.2%	50	39	3600s	5	0	1h04m	31.6M	371K	32.0M
2026-03-10 06:29	OpenClaw (2026.3.1)	openai	openai	GPT-5.3-Codex	-	53.9%	48	41	3600s	8	0	1h09m	30.3M	351K	30.7M
2026-03-10 06:25	Terminus-2 (2.0.0)	openai	openai	Kimi K2.5 (nvfp4) [W&B]	-	46.1%	41	47	3600s	13	1	2h17m	116.0M	2.1M	118.1M
2026-03-10 04:58	OpenClaw (2026.3.1)	custom	custom	Kimi K2.5 (nvfp4) [W&B]	-	37.1%	33	56	3600s	14	0	1h26m	181.2M	1.5M	182.7M
2026-03-10 03:33	OpenClaw (2026.3.1)	custom	custom	Kimi K2.5 (nvfp4) [W&B]	-	38.2%	34	55	3600s	10	0	1h25m	144.8M	1.3M	146.1M
2026-03-10 01:37	OpenClaw (2026.3.1)	custom	custom	Kimi K2.5 (nvfp4) [W&B]	-	33.7%	30	59	3600s	9	0	1h55m	167.6M	1.4M	169.0M
2026-03-10 00:20	OpenClaw (2026.3.1)	custom	custom	Kimi K2.5 (nvfp4) [W&B]	-	38.2%	34	55	3600s	10	0	1h16m	237.4M	1.6M	239.1M
2026-03-09 23:12	OpenClaw (2026.3.1)	custom	custom	Kimi K2.5 (nvfp4) [W&B]	-	37.1%	33	55	3600s	10	1	1h07m	92.9M	1.1M	94.0M
2026-03-09 14:05	Terminus-2 (2.0.0)	openai	openai	GPT-5.3-Codex	-	38.2%	34	55	3600s	13	0	1h14m	457.8M	622K	458.5M
2026-03-09 12:59	Terminus-2 (2.0.0)	openai	openai	GPT-5.3-Codex	-	41.6%	37	52	3600s	13	0	1h05m	478.0M	570K	478.6M
2026-03-09 11:33	Terminus-2 (2.0.0)	openai	openai	GPT-5.3-Codex	-	38.2%	34	55	3600s	17	0	1h25m	628.9M	671K	629.5M
2026-03-09 09:42	Terminus-2 (2.0.0)	openai	openai	GPT-5.3-Codex	-	39.3%	35	53	3600s	11	1	1h50m	515.0M	646K	515.6M
2026-03-09 08:09	Terminus-2 (2.0.0)	openai	openai	GPT-5.3-Codex	-	38.2%	34	55	3600s	13	0	1h32m	480.1M	687K	480.8M
2026-03-09 07:46	OpenClaw (2026.3.1)	openai	openai	GPT-5.4	off	28.1%	25	64	3600s	7	0	1h04m	29.5M	288K	29.8M
2026-03-09 07:00	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	off	69.7%	62	26	3600s	8	1	1h08m	146.4M	1.6M	148.0M
2026-03-09 06:38	OpenClaw (2026.3.1)	openai	openai	GPT-5.4	off	32.6%	29	60	3600s	11	0	1h07m	35.7M	345K	36.1M
2026-03-09 05:53	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	off	71.9%	64	25	3600s	5	0	1h06m	151.5M	1.5M	153.0M
2026-03-09 05:33	OpenClaw (2026.3.1)	openai	openai	GPT-5.4	off	31.5%	28	61	3600s	7	0	1h04m	34.4M	326K	34.8M
2026-03-09 04:27	OpenClaw (2026.3.1)	openai	openai	GPT-5.4	off	30.3%	27	62	3600s	7	0	1h06m	42.8M	322K	43.1M
2026-03-09 04:13	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	off	70.8%	63	25	3600s	8	1	1h40m	175.6M	1.7M	177.3M
2026-03-09 03:21	OpenClaw (2026.3.1)	openai	openai	GPT-5.4	off	29.2%	26	63	3600s	5	0	1h05m	29.6M	333K	30.0M
2026-03-09 03:07	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	off	68.5%	61	28	3600s	4	0	1h05m	153.0M	1.4M	154.4M
2026-03-08 19:15	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	51.7%	46	42	3600s	14	1	1h26m	204.5M	1.7M	206.1M
2026-03-08 17:46	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	48.3%	43	46	3600s	12	0	1h29m	193.4M	1.7M	195.1M
2026-03-08 16:04	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	46.1%	41	48	3600s	14	0	1h41m	236.0M	1.7M	237.7M
2026-03-08 14:26	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Opus 4.6	off	75.3%	67	22	3600s	3	0	1h03m	155.2M	1.4M	156.6M
2026-03-08 13:51	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	49.4%	44	44	3600s	13	1	2h12m	195.4M	1.7M	197.1M
2026-03-08 12:24	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	46.1%	41	48	3600s	15	0	1h26m	197.7M	1.7M	199.4M
2026-03-08 12:12	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 4.6	-	61.8%	55	34	3600s	10	0	1h07m	259.5M	2.2M	261.8M
2026-03-08 10:25	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 4.6	-	59.6%	53	36	3600s	7	0	1h46m	216.2M	1.9M	218.1M
2026-03-08 09:53	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	37.1%	33	55	3600s	13	1	2h30m	192.2M	1.6M	193.8M
2026-03-08 09:18	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 4.6	-	62.9%	56	33	3600s	13	0	1h06m	192.5M	2.0M	194.5M
2026-03-08 08:31	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	39.3%	35	53	3600s	13	1	1h21m	188.4M	1.6M	190.0M
2026-03-08 08:10	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 4.6	-	62.9%	56	33	3600s	6	0	1h08m	189.4M	1.9M	191.3M
2026-03-08 07:14	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	34.8%	31	58	3600s	7	0	1h16m	228.8M	1.6M	230.3M
2026-03-08 06:21	Terminus-2 (2.0.0)	anthropic	anthropic	Claude Sonnet 4.6	-	64.0%	57	32	3600s	8	0	1h48m	151.2M	1.9M	153.1M
2026-03-08 05:38	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	44.9%	40	49	3600s	11	0	1h35m	176.6M	1.4M	178.0M
2026-03-08 05:12	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Opus 4.6	medium	57.3%	51	38	3600s	10	0	1h09m	97.6M	1.4M	99.0M
2026-03-08 04:09	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi K2.5 (int4) [W&B]	-	38.2%	34	54	3600s	6	1	1h29m	171.4M	1.6M	173.1M
2026-03-08 04:03	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Opus 4.6	medium	57.3%	51	38	3600s	7	0	1h08m	78.6M	1.3M	79.9M
2026-03-08 02:53	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Opus 4.6	medium	56.2%	50	39	3600s	10	0	1h10m	74.2M	1.3M	75.6M
2026-03-08 01:44	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Opus 4.6	medium	58.4%	52	37	3600s	8	0	1h08m	73.8M	1.3M	75.0M
2026-03-08 00:37	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Opus 4.6	medium	58.4%	52	37	3600s	5	0	1h07m	83.4M	1.3M	84.7M
2026-03-08 00:33	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	off	44.9%	40	48	3600s	12	1	2h06m	726.8M	1.0M	727.8M
2026-03-07 23:28	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Sonnet 4.6	-	51.7%	46	41	3600s	3	2	1h08m	95.3M	2.1M	97.5M
2026-03-07 23:25	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	off	43.8%	39	50	3600s	14	0	1h08m	667.0M	905K	667.9M
2026-03-07 22:16	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	off	42.7%	38	51	3600s	12	0	1h08m	707.0M	878K	707.9M
2026-03-07 22:07	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Sonnet 4.6	-	55.1%	49	40	3600s	5	0	1h20m	86.8M	2.0M	88.8M
2026-03-07 20:58	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Sonnet 4.6	-	56.2%	50	39	3600s	3	0	1h09m	78.6M	2.0M	80.6M
2026-03-07 20:10	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	off	41.6%	37	51	3600s	15	1	2h05m	759.6M	939K	760.6M
2026-03-07 19:47	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Sonnet 4.6	-	51.7%	46	43	3600s	2	0	1h10m	71.6M	2.0M	73.5M
2026-03-07 18:57	Terminus-2 (2.0.0)	openai	openai	GPT-5.4	off	47.2%	42	46	3600s	14	1	1h12m	775.5M	982K	776.5M
2026-03-07 18:15	OpenClaw (2026.3.1)	anthropic	anthropic	Claude Sonnet 4.6	-	48.3%	43	46	3600s	6	0	1h31m	115.2M	2.3M	117.6M
2026-03-07 16:53	Claude Code (2.1.63)	anthropic	anthropic	Claude Opus 4.6	high	67.4%	60	28	3600s	6	1	1h22m	222.3M	1.2M	223.5M
2026-03-07 15:47	Claude Code (2.1.63)	anthropic	anthropic	Claude Opus 4.6	high	62.9%	56	33	3600s	4	0	1h05m	195.9M	1.6M	197.5M
2026-03-07 14:27	Claude Code (2.1.63)	anthropic	anthropic	Claude Opus 4.6	high	58.4%	52	36	3600s	6	1	1h20m	169.0M	1.2M	170.2M
2026-03-07 13:18	Claude Code (2.1.63)	anthropic	anthropic	Claude Opus 4.6	high	59.6%	53	36	3600s	7	0	1h09m	188.9M	1.2M	190.0M
2026-03-07 11:53	Claude Code (2.1.63)	anthropic	anthropic	Claude Opus 4.6	high	67.4%	60	29	3600s	5	0	1h24m	209.0M	1.4M	210.5M
2026-03-07 10:13	Claude Code (2.1.63)	anthropic	anthropic	Claude Sonnet 4.6	-	53.9%	48	41	3600s	12	0	1h39m	202.1M	2.1M	204.2M
2026-03-07 09:08	Claude Code (2.1.63)	anthropic	anthropic	Claude Sonnet 4.6	-	57.3%	51	38	3600s	4	0	1h05m	166.9M	1.7M	168.5M
2026-03-07 08:02	Claude Code (2.1.63)	anthropic	anthropic	Claude Sonnet 4.6	-	62.9%	56	33	3600s	3	0	1h05m	185.9M	1.8M	187.7M
2026-03-07 06:56	Claude Code (2.1.63)	anthropic	anthropic	Claude Sonnet 4.6	-	57.3%	51	38	3600s	6	0	1h05m	210.7M	2.2M	213.0M
2026-03-07 04:37	Claude Code (2.1.63)	anthropic	anthropic	Claude Sonnet 4.6	-	56.2%	50	38	3600s	5	1	2h19m	216.0M	2.3M	218.3M

About WolfBench

by Wolfram Ravenwolf – who evaluates models for breakfast, builds agents at night, and preaches AI usefulness all day long.

Welcome to WolfBench – we’re just getting started. What you see here is an early preview with only a handful of models and agents tested so far. We’re continuously expanding the lineup, running fresh evals, and sharing interesting findings and insights along the way. Watch this space.

AI agents are becoming essential tools. Every week, a new model comes out and claims to be “the best at coding” or “SOTA on agentic tasks.” But what does that actually mean for you – the person who’s going to throw real work at these things?

A single score tells you almost nothing.

Most benchmarks give you one number: “Model X scored 42% on Benchmark Y.” Great. But can you rely on it? Was that a lucky run? Would it score the same tomorrow? What’s the floor – the tasks it always nails? What’s the ceiling – what it could do if the stars align?

WolfBench exists because we got tired of meaningless leaderboards. We wanted to know which model, which agent, and which settings actually deliver the best results on real agentic tasks – not just on paper, but in practice, consistently, across multiple runs.

What is it?

WolfBench is an evaluation framework built on top of Terminal-Bench 2.0, a popular agentic benchmark consisting of 89 diverse real-world tasks. These aren’t just coding puzzles. They span the kind of work you’d actually ask an AI agent to do:

System administration: headless terminal interaction, Git server configuration, Nginx request logging
DevOps & infrastructure: package distribution search, database WAL recovery, PyPI server setup
Security: code vulnerability fixes, 7z hash cracking, ELF binary extraction, Git leak recovery
Data & ML ops: financial document processing, HuggingFace model inference, scientific stack modernization
Problem solving: constraint scheduling, adaptive rejection sampling, concurrent task cancellation

The key word is agentic: these tasks require the model to plan, execute shell commands, inspect results, debug failures, and iterate – just like a human developer or sysadmin would. No multiple-choice shortcuts. No toy puzzles. Real work in real sandboxed environments.

Why WolfBench is different

Five-metric framework: Instead of a single average score, we report five complementary metrics that together paint a far more complete picture of what an AI agent can actually do – from the worst-case floor to the theoretical ceiling.
Uniform conditions: Instead of Terminal-Bench 2.0’s default task-specific timeouts and varying sandbox resources, every task in a run gets the same timeout and identical sandbox resources. This ensures scores reflect model and agent capability – not whether an inference endpoint was temporarily overloaded or a sandbox ran out of memory.
Multi-agent comparison: Same model, different agents. Same agent, different models. Different timeouts, concurrency levels, thinking modes. The goal is to understand what matters – not just what scored highest in one particular instance.
Multi-run methodology: A single run is statistically meaningless – variance can swing results widely. We run multiple replicates per configuration to get stable, trustworthy numbers.
Transparency: Every run is collected, classified, and curated with full metadata: tokens consumed, cache hit rates, duration, timeout, concurrency, agent version, thinking mode, etc. Nothing is hidden.

The Five-Metric Framework

Performance is a distribution, not a point. One number can’t capture what an AI agent is truly capable of. Five numbers get a lot closer.

★ Ceiling: What’s theoretically possible?

The union of all tasks ever solved across all runs. If the model solved task A in run 3 and task B in run 5 (but never both in the same run), both count toward the ceiling.

It tells you the theoretical maximum performance this model is capable of with a given agent – even if no single run achieves it. It reveals variance-limited tasks: solvable, but not reliably.

▲ Best-of: What’s the peak in a single run?

The highest score from any individual run.

This is the “marketing number” – but with context. The closer the best-of is to the average, the more consistent the model performs. A large gap between best-of and average means you’re rolling dice every time you run it.

∅ Average: What can you normally expect?

The mean score across all valid runs.

This is the most commonly reported metric – and it is useful, but only with enough runs to be stable. With a single run? It’s a coin flip.

▼ Worst-of: How bad can a single run get?

The lowest score from any individual run.

This is the opposite of best-of – the floor, the worst case. The gap between worst-of and best-of defines the full score range across all runs. A narrow range means predictable performance; a wide range means you’re rolling dice.

■ Solid: What does it always get right?

Tasks that the model solves across all runs – the rock-solid base with zero variance.

The higher the solid base, the more dependable the agent is. These are the tasks you can confidently delegate and expect success every time. A model with a high solid base and moderate average is often more reliable in practice than one with a high average but low solid base – because you know what you’re getting.

Reading the Chart

The five metrics are shown for each model/configuration as stacked bar segments from the rock-solid base up to the ceiling. Optional 3D mode adds token volume as depth: input tokens in front, output tokens behind. The spread between the segments tells you as much as the numbers themselves:

Tight spread (metrics close together) = consistent, predictable AI agent
Wide spread (big gap between solid and ceiling) = high variance, unreliable
High ceiling, low average = the model can do it, but usually doesn’t – needs more runs or better settings
High solid, close to average = rock-solid workhorse you can count on

The Bottom Line

Performance is more complex than a single average score – and the decisions you make based on benchmarks deserve better data than that. WolfBench gives you five angles on every model and configuration, so you can form a more complete and realistic judgement of what an AI agent will actually deliver when you put it to work.

Because at the end of the day, you don’t just want to know which model scored the highest. You want to know which one you can trust.

What’s Next

We will continuously add models and agents to the chart, publish the traces and evals on W&B Weave, and release regular blog posts detailing interesting and insightful findings.

This benchmark offers enormous potential for discovery. For instance: Why does xhigh reasoning improve GPT 5.4’s performance while max effort degrades Opus 4.6’s results? How does Claude Code fare when running a GPT or Gemini model compared to running directly with Opus or Sonnet – or Codex with Claude or Gemini? Is a “cheap” model actually cost-effective if it consumes far more tokens than a more expensive alternative? How does quantization affect performance of local models in agentic tasks?

So many possibilities for analysis – and for posting about it! Stay tuned – and if you want to be the first to know when new results come in, follow me on X and LinkedIn.

Inference and sandbox compute sponsored by CoreWeave: The Essential Cloud for AI.
Additional sandbox compute by Daytona – Secure Infrastructure for Running AI-Generated Code.
Built with Harbor for orchestration, Terminal-Bench 2.0 for tasks, and W&B Weave for tracking.
Charts and dashboards generated with marimo notebooks.
Explore the complete data and tooling suite on our WolfBench GitHub.

WolfBench (2026-06-02)

About WolfBench

What is it?

Why WolfBench is different

The Five-Metric Framework

★ Ceiling: What’s theoretically possible?

▲ Best-of: What’s the peak in a single run?

∅ Average: What can you normally expect?

▼ Worst-of: How bad can a single run get?

■ Solid: What does it always get right?

Reading the Chart

The Bottom Line

What’s Next