WolfBench

Wolfram Ravenwolf’s Four-Metric Framework · based on Terminal-Bench 2.0

One score is not enough.
Because performance is a distribution, not a point.

Most benchmarks report just a single average. WolfBench shows four metrics: the rock-solid base you can always count on, the average you can expect, the best a single run achieved, and the ceiling of what’s theoretically possible. The spread between them tells you how consistent – or how unpredictable – an AI agent really is.
Learn more ↓

T2 = Terminus-2CC = Claude CodeOC = OpenClaw

▲ Ceiling (ever solved)★ Best-of (peak run)∅ Average (mean score)■ Solid (always solved)

👁

Claude Opus 4.6Claude Sonnet 4.6Kimi K2.5gpt-5.4

0%10%20%30%40%50%60%70%80%90%100%

Claude Opus 4.6

T2
2.0.0

■55% ∅71% ★75% ▲84%

5R@1h

T2
2.0.0

■55% ∅73% ★75% ▲88%

5R@2h

CC
2.1.63

■45% ∅63% ★67% ▲81%

5R@1h

CC
2.1.63

■46% ∅64% ★69% ▲80%

5R@2h

OC
2026.2.17

■33% ∅51% ★56% ▲64%

5R@2h

OC
2026.3.1

■42% ∅58% ★58% ▲74%

5R@1h

Claude Sonnet 4.6

T2
2.0.0

■42% ∅62% ★64% ▲81%

5R@1h

T2
2.0.0

■42% ∅61% ★63% ▲81%

5R@2h

CC
2.1.63

■40% ∅52% ★54% ▲67%

CC
2.1.63

■40% ∅58% ★63% ▲75%

5R@1h

CC
2.1.66

■43% ∅57% ★60% ▲70%

3R@2h

CC
2.1.68

■47% ∅54% ★55% ▲61%

2R@2h

OC
2026.3.1

■36% ∅53% ★56% ▲70%

5R@1h

OC
2026.3.1

■37% ∅53% ★58% ▲69%

5R@2h

Kimi K2.5

T2
2.0.0

■31% ∅48% ★52% ▲63%

5R@1h

T2
2.0.0

■28% ∅49% ★52% ▲65%

5R@2h

OC
2026.2.17

■10% ∅32% ★35% ▲57%

5R@2h

OC
2026.3.1

■13% ∅39% ★45% ▲58%

5R@1h

gpt-5.4

T2
2.0.0

■28% ∅44% ★47% ▲61%

5R@1h

OC
2026.3.1

■9% ∅30% ★33% ▲52%

5R@1h

OC
2026.3.1

■11% ∅30% ★35% ▲53%

5R@2h

Run Details (100 runs)

Date	Agent	Provider	Vendor	Model	Think	Score	Pass	Fail	Timeout	T/O	Err	Duration	In	Out
2026-03-09 07:46	OpenClaw (2026.3.1)	openai	openai	gpt-5.4	-	28.1%	25	64	3600s	7	0	1h04m	29.5M	288K
2026-03-09 07:00	Terminus-2 (2.0.0)	anthropic	anthropic	claude-opus-4-6	-	69.7%	62	26	3600s	8	1	1h08m	146.4M	1.6M
2026-03-09 06:38	OpenClaw (2026.3.1)	openai	openai	gpt-5.4	-	32.6%	29	60	3600s	11	0	1h07m	35.7M	345K
2026-03-09 05:53	Terminus-2 (2.0.0)	anthropic	anthropic	claude-opus-4-6	-	71.9%	64	25	3600s	5	0	1h06m	151.5M	1.5M
2026-03-09 05:33	OpenClaw (2026.3.1)	openai	openai	gpt-5.4	-	31.5%	28	61	3600s	7	0	1h04m	34.4M	326K
2026-03-09 04:27	OpenClaw (2026.3.1)	openai	openai	gpt-5.4	-	30.3%	27	62	3600s	7	0	1h06m	42.8M	322K
2026-03-09 04:13	Terminus-2 (2.0.0)	anthropic	anthropic	claude-opus-4-6	-	70.8%	63	25	3600s	8	1	1h40m	175.6M	1.7M
2026-03-09 03:21	OpenClaw (2026.3.1)	openai	openai	gpt-5.4	-	29.2%	26	63	3600s	5	0	1h05m	29.6M	333K
2026-03-09 03:07	Terminus-2 (2.0.0)	anthropic	anthropic	claude-opus-4-6	-	68.5%	61	28	3600s	4	0	1h05m	153.0M	1.4M
2026-03-08 19:15	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi-K2.5	-	51.7%	46	42	3600s	14	1	1h26m	204.5M	1.7M
2026-03-08 17:46	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi-K2.5	-	48.3%	43	46	3600s	12	0	1h29m	193.4M	1.7M
2026-03-08 16:04	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi-K2.5	-	46.1%	41	48	3600s	14	0	1h41m	236.0M	1.7M
2026-03-08 14:26	Terminus-2 (2.0.0)	anthropic	anthropic	claude-opus-4-6	-	75.3%	67	22	3600s	3	0	1h03m	155.2M	1.4M
2026-03-08 13:51	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi-K2.5	-	49.4%	44	44	3600s	13	1	2h12m	195.4M	1.7M
2026-03-08 12:24	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi-K2.5	-	46.1%	41	48	3600s	15	0	1h26m	197.7M	1.7M
2026-03-08 12:12	Terminus-2 (2.0.0)	anthropic	anthropic	claude-sonnet-4-6	-	61.8%	55	34	3600s	10	0	1h07m	259.5M	2.2M
2026-03-08 10:25	Terminus-2 (2.0.0)	anthropic	anthropic	claude-sonnet-4-6	-	59.6%	53	36	3600s	7	0	1h46m	216.2M	1.9M
2026-03-08 09:53	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi-K2.5	-	37.1%	33	55	3600s	13	1	2h30m	192.2M	1.6M
2026-03-08 09:18	Terminus-2 (2.0.0)	anthropic	anthropic	claude-sonnet-4-6	-	62.9%	56	33	3600s	13	0	1h06m	192.5M	2.0M
2026-03-08 08:31	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi-K2.5	-	39.3%	35	53	3600s	13	1	1h21m	188.4M	1.6M
2026-03-08 08:10	Terminus-2 (2.0.0)	anthropic	anthropic	claude-sonnet-4-6	-	62.9%	56	33	3600s	6	0	1h08m	189.4M	1.9M
2026-03-08 07:14	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi-K2.5	-	34.8%	31	58	3600s	7	0	1h16m	228.8M	1.6M
2026-03-08 06:21	Terminus-2 (2.0.0)	anthropic	anthropic	claude-sonnet-4-6	-	64.0%	57	32	3600s	8	0	1h48m	151.2M	1.9M
2026-03-08 05:38	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi-K2.5	-	44.9%	40	49	3600s	11	0	1h35m	176.6M	1.4M
2026-03-08 05:12	OpenClaw (2026.3.1)	anthropic	anthropic	claude-opus-4-6	-	57.3%	51	38	3600s	10	0	1h09m	97.6M	1.4M
2026-03-08 04:09	OpenClaw (2026.3.1)	wandb	moonshotai	Kimi-K2.5	-	38.2%	34	54	3600s	6	1	1h29m	171.4M	1.6M
2026-03-08 04:03	OpenClaw (2026.3.1)	anthropic	anthropic	claude-opus-4-6	-	57.3%	51	38	3600s	7	0	1h08m	78.6M	1.3M
2026-03-08 02:53	OpenClaw (2026.3.1)	anthropic	anthropic	claude-opus-4-6	-	56.2%	50	39	3600s	10	0	1h10m	74.2M	1.3M
2026-03-08 01:44	OpenClaw (2026.3.1)	anthropic	anthropic	claude-opus-4-6	-	58.4%	52	37	3600s	8	0	1h08m	73.8M	1.3M
2026-03-08 00:37	OpenClaw (2026.3.1)	anthropic	anthropic	claude-opus-4-6	-	58.4%	52	37	3600s	5	0	1h07m	83.4M	1.3M
2026-03-08 00:33	Terminus-2 (2.0.0)	openai	openai	gpt-5.4	-	44.9%	40	48	3600s	12	1	2h06m	726.8M	1.0M
2026-03-07 23:28	OpenClaw (2026.3.1)	anthropic	anthropic	claude-sonnet-4-6	-	51.7%	46	41	3600s	3	2	1h08m	95.3M	2.1M
2026-03-07 23:25	Terminus-2 (2.0.0)	openai	openai	gpt-5.4	-	43.8%	39	50	3600s	14	0	1h08m	667.0M	905K
2026-03-07 22:16	Terminus-2 (2.0.0)	openai	openai	gpt-5.4	-	42.7%	38	51	3600s	12	0	1h08m	707.0M	878K
2026-03-07 22:07	OpenClaw (2026.3.1)	anthropic	anthropic	claude-sonnet-4-6	-	55.1%	49	40	3600s	5	0	1h20m	86.8M	2.0M
2026-03-07 20:58	OpenClaw (2026.3.1)	anthropic	anthropic	claude-sonnet-4-6	-	56.2%	50	39	3600s	3	0	1h09m	78.6M	2.0M
2026-03-07 20:10	Terminus-2 (2.0.0)	openai	openai	gpt-5.4	-	41.6%	37	51	3600s	15	1	2h05m	759.6M	939K
2026-03-07 19:47	OpenClaw (2026.3.1)	anthropic	anthropic	claude-sonnet-4-6	-	51.7%	46	43	3600s	2	0	1h10m	71.6M	2.0M
2026-03-07 18:57	Terminus-2 (2.0.0)	openai	openai	gpt-5.4	-	47.2%	42	46	3600s	14	1	1h12m	775.5M	982K
2026-03-07 18:15	OpenClaw (2026.3.1)	anthropic	anthropic	claude-sonnet-4-6	-	48.3%	43	46	3600s	6	0	1h31m	115.2M	2.3M
2026-03-07 16:53	Claude Code (2.1.63)	anthropic	anthropic	claude-opus-4-6	-	67.4%	60	28	3600s	6	1	1h22m	222.3M	1.2M
2026-03-07 15:47	Claude Code (2.1.63)	anthropic	anthropic	claude-opus-4-6	-	62.9%	56	33	3600s	4	0	1h05m	195.9M	1.6M
2026-03-07 14:27	Claude Code (2.1.63)	anthropic	anthropic	claude-opus-4-6	-	58.4%	52	36	3600s	6	1	1h20m	169.0M	1.2M
2026-03-07 13:18	Claude Code (2.1.63)	anthropic	anthropic	claude-opus-4-6	-	59.6%	53	36	3600s	7	0	1h09m	188.9M	1.2M
2026-03-07 11:53	Claude Code (2.1.63)	anthropic	anthropic	claude-opus-4-6	-	67.4%	60	29	3600s	5	0	1h24m	209.0M	1.4M
2026-03-07 10:13	Claude Code (2.1.63)	anthropic	anthropic	claude-sonnet-4-6	-	53.9%	48	41	3600s	12	0	1h39m	202.1M	2.1M
2026-03-07 09:08	Claude Code (2.1.63)	anthropic	anthropic	claude-sonnet-4-6	-	57.3%	51	38	3600s	4	0	1h05m	166.9M	1.7M
2026-03-07 08:02	Claude Code (2.1.63)	anthropic	anthropic	claude-sonnet-4-6	-	62.9%	56	33	3600s	3	0	1h05m	185.9M	1.8M
2026-03-07 06:56	Claude Code (2.1.63)	anthropic	anthropic	claude-sonnet-4-6	-	57.3%	51	38	3600s	6	0	1h05m	210.7M	2.2M
2026-03-07 04:37	Claude Code (2.1.63)	anthropic	anthropic	claude-sonnet-4-6	-	56.2%	50	38	3600s	5	1	2h19m	216.0M	2.3M
2026-03-06 03:38	OpenClaw (2026.3.1)	openrouter	openai	gpt-5.4	-	22.5%	20	69	7200s	3	0	2h03m	32.2M	293K
2026-03-06 01:22	OpenClaw (2026.3.1)	openrouter	openai	gpt-5.4	-	34.8%	31	58	7200s	9	0	2h16m	32.0M	289K
2026-03-05 23:18	OpenClaw (2026.3.1)	openrouter	openai	gpt-5.4	-	28.1%	25	64	7200s	7	0	2h04m	37.2M	299K
2026-03-05 21:12	OpenClaw (2026.3.1)	openrouter	openai	gpt-5.4	-	34.8%	31	58	7200s	6	0	2h06m	34.4M	300K
2026-03-05 19:07	OpenClaw (2026.3.1)	openrouter	openai	gpt-5.4	-	28.1%	25	64	7200s	9	0	2h04m	31.5M	301K
2026-03-04 12:30	Claude Code (2.1.68)	anthropic	anthropic	claude-sonnet-4-6	-	55.1%	49	40	7200s	3	0	2h03m	364.3M	3.1M
2026-03-04 10:16	Claude Code (2.1.68)	anthropic	anthropic	claude-sonnet-4-6	-	52.8%	47	40	7200s	3	2	2h13m	411.8M	4.1M
2026-03-04 08:10	Claude Code (2.1.66)	anthropic	anthropic	claude-sonnet-4-6	-	58.4%	52	36	7200s	2	1	2h05m	428.8M	4.1M
2026-03-04 06:06	Claude Code (2.1.66)	anthropic	anthropic	claude-sonnet-4-6	-	59.6%	53	35	7200s	3	1	2h03m	347.5M	4.1M
2026-03-04 03:44	Claude Code (2.1.66)	anthropic	anthropic	claude-sonnet-4-6	-	53.9%	48	40	7200s	2	1	2h21m	361.1M	4.0M
2026-03-04 00:57	Claude Code (2.1.63)	anthropic	anthropic	claude-opus-4-6	-	66.3%	59	29	7200s	3	1	2h10m	475.9M	3.5M
2026-03-03 22:18	Claude Code (2.1.63)	anthropic	anthropic	claude-opus-4-6	-	58.4%	52	36	7200s	3	1	2h04m	314.3M	2.5M
2026-03-03 18:30	Claude Code (2.1.63)	anthropic	anthropic	claude-opus-4-6	-	62.9%	56	31	7200s	1	2	2h03m	425.4M	3.4M
2026-03-03 15:54	Claude Code (2.1.63)	anthropic	anthropic	claude-opus-4-6	-	68.5%	61	27	7200s	3	1	2h05m	404.9M	3.1M
2026-03-03 12:25	Claude Code (2.1.63)	anthropic	anthropic	claude-opus-4-6	-	61.8%	55	32	7200s	4	2	2h08m	358.8M	2.1M
2026-03-03 02:15	OpenClaw (2026.3.1)	anthropic	anthropic	claude-sonnet-4-6	-	50.6%	45	44	7200s	1	0	2h06m	78.0M	1.7M
2026-03-02 23:43	OpenClaw (2026.3.1)	anthropic	anthropic	claude-sonnet-4-6	-	51.7%	46	42	7200s	3	1	2h18m	91.0M	2.2M
2026-03-02 18:37	OpenClaw (2026.3.1)	anthropic	anthropic	claude-sonnet-4-6	-	52.8%	47	42	7200s	2	0	2h11m	97.2M	2.3M
2026-03-02 15:22	OpenClaw (2026.3.1)	anthropic	anthropic	claude-sonnet-4-6	-	58.4%	52	37	7200s	4	0	2h12m	111.8M	2.2M
2026-03-02 13:05	OpenClaw (2026.3.1)	anthropic	anthropic	claude-sonnet-4-6	-	52.8%	47	42	7200s	3	0	2h10m	81.5M	2.0M
2026-03-02 12:32	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi-K2.5	-	51.7%	46	41	7200s	8	2	3h51m	398.8M	1.9M
2026-03-02 08:42	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi-K2.5	-	50.6%	45	42	7200s	3	2	3h49m	306.0M	1.6M
2026-03-02 07:46	Claude Code (2.1.63)	anthropic	anthropic	claude-sonnet-4-6	-	50.6%	45	42	-	21	2	1h48m	186.8M	2.5M
2026-03-02 06:37	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi-K2.5	-	44.9%	40	47	7200s	2	2	2h05m	341.2M	1.8M
2026-03-02 05:59	Claude Code (2.1.63)	anthropic	anthropic	claude-sonnet-4-6	-	50.6%	45	41	-	19	3	1h46m	198.0M	2.1M
2026-03-02 04:30	Claude Code (2.1.63)	anthropic	anthropic	claude-sonnet-4-6	-	53.9%	48	39	-	16	2	1h28m	219.7M	2.5M
2026-03-02 04:28	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi-K2.5	-	51.7%	46	42	7200s	6	1	2h08m	314.7M	1.7M
2026-03-02 03:01	Claude Code (2.1.63)	anthropic	anthropic	claude-sonnet-4-6	-	53.9%	48	39	-	16	2	1h28m	232.3M	2.2M
2026-03-02 01:57	Claude Code (2.1.63)	anthropic	anthropic	claude-sonnet-4-6	-	51.7%	46	42	-	19	1	1h03m	188.2M	2.5M
2026-03-02 01:51	Terminus-2 (2.0.0)	wandb	moonshotai	Kimi-K2.5	-	44.9%	40	48	7200s	4	1	2h37m	330.8M	1.9M
2026-03-01 22:50	OpenClaw (2026.2.17)	wandb	moonshotai	Kimi-K2.5	-	33.7%	30	58	7200s	4	1	2h15m	91.5M	1.1M
2026-03-01 20:38	OpenClaw (2026.2.17)	wandb	moonshotai	Kimi-K2.5	-	29.2%	26	63	7200s	2	0	2h11m	84.4M	1.1M
2026-03-01 18:26	OpenClaw (2026.2.17)	wandb	moonshotai	Kimi-K2.5	-	34.8%	31	56	7200s	4	2	2h12m	83.5M	1.2M
2026-03-01 16:11	OpenClaw (2026.2.17)	wandb	moonshotai	Kimi-K2.5	-	29.2%	26	63	7200s	3	0	2h14m	99.5M	1.1M
2026-03-01 13:43	OpenClaw (2026.2.17)	wandb	moonshotai	Kimi-K2.5	-	31.5%	28	61	7200s	3	0	2h27m	97.2M	1.1M
2026-02-28 07:31	Terminus-2 (2.0.0)	anthropic	anthropic	claude-sonnet-4-6	-	62.9%	56	33	7200s	7	0	2h09m	246.3M	2.1M
2026-02-28 06:08	Terminus-2 (2.0.0)	anthropic	anthropic	claude-opus-4-6	-	75.3%	67	22	7200s	2	0	2h03m	186.1M	1.8M
2026-02-28 04:06	Terminus-2 (2.0.0)	anthropic	anthropic	claude-sonnet-4-6	-	56.2%	50	39	7200s	10	0	3h24m	212.7M	2.1M
2026-02-28 04:03	Terminus-2 (2.0.0)	anthropic	anthropic	claude-opus-4-6	-	68.5%	61	28	7200s	4	0	2h05m	219.1M	1.8M
2026-02-28 00:41	Terminus-2 (2.0.0)	anthropic	anthropic	claude-opus-4-6	-	74.2%	66	23	7200s	2	0	2h24m	237.4M	1.6M
2026-02-27 22:33	Terminus-2 (2.0.0)	anthropic	anthropic	claude-opus-4-6	-	71.9%	64	25	7200s	1	0	2h08m	230.0M	1.8M
2026-02-27 20:29	Terminus-2 (2.0.0)	anthropic	anthropic	claude-opus-4-6	-	73.0%	65	24	7200s	1	0	2h03m	163.1M	1.4M
2026-02-27 11:54	OpenClaw (2026.2.17)	anthropic	anthropic	claude-opus-4-6	-	51.7%	46	43	7200s	4	0	3h33m	81.2M	1.1M
2026-02-27 09:47	OpenClaw (2026.2.17)	anthropic	anthropic	claude-opus-4-6	-	47.2%	42	47	7200s	3	0	2h06m	80.9M	915K
2026-02-27 08:53	Terminus-2 (2.0.0)	anthropic	anthropic	claude-sonnet-4-6	-	61.8%	55	34	7200s	7	0	2h06m	204.6M	2.1M
2026-02-27 07:32	OpenClaw (2026.2.17)	anthropic	anthropic	claude-opus-4-6	-	56.2%	50	39	7200s	4	0	2h14m	63.8M	831K
2026-02-27 05:25	OpenClaw (2026.2.17)	anthropic	anthropic	claude-opus-4-6	-	52.8%	47	41	7200s	1	1	2h07m	91.4M	983K
2026-02-27 05:19	Terminus-2 (2.0.0)	anthropic	anthropic	claude-sonnet-4-6	-	61.8%	55	34	7200s	7	0	3h34m	234.5M	2.0M
2026-02-27 03:17	OpenClaw (2026.2.17)	anthropic	anthropic	claude-opus-4-6	-	47.2%	42	47	7200s	2	0	2h07m	88.5M	1.1M
2026-02-27 03:12	Terminus-2 (2.0.0)	anthropic	anthropic	claude-sonnet-4-6	-	62.9%	56	33	7200s	5	0	2h06m	199.3M	1.9M

The impact of time(outs) on agentic benchmarks

by Wolfram Ravenwolf – who evaluates models for breakfast, builds agents at night, and preaches AI usefulness all day long.

In this post, we’ll look at why timeouts are necessary and why I locked WolfBench at 1 hour.

When we launched WolfBench, a four-metric evaluation framework built on Terminal-Bench 2.0, we started with a fixed 2-hour timeout per task. After analyzing nearly 10,000 task results across 8 models, I’m changing that to 1 hour. Here’s why—and why it matters for anyone who cares about meaningful AI benchmarks.

Why agentic benchmarks need timeouts

Terminal-Bench 2.0 is a real-world agentic benchmark: 89 tasks spanning system administration, DevOps, security, data/ML operations, and problem-solving. Models must plan, execute shell commands, inspect results, debug failures, and iterate—like a real technical professional. No multiple-choice tricks. No toy puzzles. Just real work in sandboxed environments.

But when a model is free to act, it can get stuck in a loop, retrying the same failing approach endlessly, burning tokens and compute for nothing. Timeouts exist to prevent that.

The problem with per-task timeouts

By default, Terminal-Bench 2.0 assigns each task its own timeout, ranging from 10 minutes to over 3 hours. The majority—45 out of 89 tasks—get just 15 minutes.

This severely penalizes slower models and endpoints. Even if a model could solve a task, a temporarily overloaded API or a slightly slower inference speed can prevent it from finishing in time. When you’re evaluating model capability, you don’t want to measure endpoint performance.

That’s why some recent benchmarks (e.g., MiniMax-M2.5, GLM-5) use fixed 2-hour timeouts for every task. Some, like Kimi K2.5, even disable reasoning/thinking—because models that invest in up-front planning over immediate execution can actually score worse in time-constrained agentic benchmarks. (This is also why you always need to read the fine print: two benchmarks with the same name can produce very different results depending on how they were configured.)

So why not just use 2 hours?

Because the data says 1 hour is the sweet spot.

I analyzed 9,636 task results across all models and configurations in my evaluation runs: every pass, every fail, every actual duration. Then I asked a simple question: At each possible timeout cap, how many successful task completions would we lose?

Timeout Cap	Passes Lost	% of Total
15 min	846 / 4,708	18.0% – way too tight
30 min	395 / 4,708	8.4% – still too aggressive
60 min	136 / 4,708	2.9% – the sweet spot
90 min	76 / 4,708	1.6% – diminishing returns
120 min	1 / 4,708	0.0% – almost nothing gained

Going from 60 to 120 minutes saves just 135 additional passes out of nearly 5,000. That’s a 2.9% gain—in exchange for doubling the maximum time an agent can spend looping on a hopeless task.

A model that can solve a task typically does so well within 60 minutes. A model that’s still looping after an hour is usually not going to find the answer in hour two—it’s just going to burn more tokens failing the same way.

The cost of generous timeouts

And “looping on hopeless tasks” isn’t hypothetical. Across all evaluated runs, I measured 921.9 hours of wasted compute—time spent on attempts that ran past their default timeout and still failed. The worst offenders burned 40–60 hours each, producing nothing.

Longer timeouts don’t just waste time. They waste tokens, inflate costs, and—in extreme cases—can fill up disk space when models generate output in tight failure loops. The 2-hour cap doubles the blast radius of every one of these failure modes compared to 1 hour.

What 1 hour actually loses

The 136 passes lost at a 60-minute cap come primarily from just a handful of tasks: mailman (22 lost passes), path-tracing (10), and rstan-to-pystan (9). These are tasks with shorter default timeouts that occasionally need over an hour to succeed. Meanwhile, the task with the longest default timeout in the entire benchmark—build-pov-ray at 200 minutes—has never needed more than 34 minutes to succeed. And sam-cell-seg, with its 120-minute default? It has a 0% pass rate across all attempts. Extra time doesn’t help what can’t be solved.

The verdict

A fixed 1-hour timeout per task is the optimal configuration for WolfBench. It:

Preserves 97.1% of all successful completions
Eliminates hundreds of hours of wasted compute
Still gives every task 4x more time than the benchmark’s most common default (15 min)
Only 2 tasks get less time than their default, and neither is affected
Ensures scores reflect model capability, not endpoint speed

For a benchmark designed to measure what models can do—not how performant the infrastructure is—that’s the right call.

WolfBench is now locked at 1 hour per task. All future evaluations will use this configuration, and I’ve already begun re-running existing models under the new timeout for consistency.

Analysis based on 9,636 task results across 6 models (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Kimi K2.5, MiniMax-M2.5, GLM-5).
Inference sponsored by CoreWeave. Sandbox compute by Daytona. Built with Harbor, Terminal-Bench 2.0, and W&B Weave.

WolfBench (2026-03-10)

The impact of time(outs) on agentic benchmarks

Why agentic benchmarks need timeouts

The problem with per-task timeouts

So why not just use 2 hours?

The cost of generous timeouts

What 1 hour actually loses

The verdict