Spaces:

TIGER-Lab
/

ClawBench

Running

App Files Files Community

ClawBench / app.py

AgPerry

UI: default to V2 Hermes (single harness) + add Reset button to mirror website headline view

74fbf87 verified 10 days ago

raw

history blame contribute delete

12.3 kB

	"""ClawBench leaderboard Space — reads results.csv from the TIGER-Lab/ClawBench dataset.

	Two-stage scoring per https://github.com/reacher-z/ClawBench/blob/main/eval/scoring.md:
	- Intercepted (Stage 1) = fraction of runs whose final HTTP request hit the per-task URL/method schema.
	- Reward (Stage 2) = fraction that also passed an LLM judge on the intercepted payload. Headline metric.
	"""

	import io
	import time
	import urllib.request

	import gradio as gr
	import pandas as pd

	RESULTS_URL = (
	"https://huggingface.co/datasets/TIGER-Lab/ClawBench/resolve/main/leaderboard/results.csv"
	)


	CITATION = """@misc{zhang2026clawbench,
	title = {ClawBench: Can AI Agents Complete Everyday Online Tasks?},
	author = {Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},
	year = {2026},
	eprint = {2604.08523},
	archivePrefix = {arXiv},
	primaryClass = {cs.AI},
	url = {https://arxiv.org/abs/2604.08523}
	}"""

	INTRO = """# 🏆 ClawBench — Web Agent Benchmark

	Can AI agents complete everyday online tasks? ClawBench scores agents on real, live websites (booking flights, ordering groceries, submitting job applications). Two corpora: V1 — 153 tasks across 144 websites · V2 — 130 newer tasks across 63 platforms. Every run is graded twice: a deterministic HTTP-request interception check (Stage 1, the sort key) — then an LLM judge on the intercepted payload (Stage 2 = `Reward`).

	[📖 Paper](https://arxiv.org/abs/2604.08523) · [💻 GitHub](https://github.com/reacher-z/ClawBench) · [🗂 Dataset](https://huggingface.co/datasets/TIGER-Lab/ClawBench) · [🎞 Traces V1](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace) · [🎞 Traces V2](https://huggingface.co/datasets/TIGER-Lab/ClawBenchV2Trace) · [🌐 Site](https://claw-bench.com)
	"""

	TABLE_INTRO = """Intercepted (Stage 1) = agent's final HTTP request matched the task's URL/method schema — deterministic, no judge. Reward (lenient) (Stage 2, headline metric, default sort key) = judge confirms the intercepted payload fulfilled the instruction under the default rubric (no explicit contradiction → match). Reward (strict) = same judge (default `deepseek/deepseek-v4-pro`) under the stricter rubric (ambiguous → mismatch), shown for ablation. Rows are ranked by Reward (lenient) DESC, then Intercepted DESC as tiebreak. V2 is Hermes-only; alternative harnesses are evaluated separately. Partial = batch attempted fewer than the full corpus (mid-run abort / queue cap); rates are over attempted, not over corpus."""

	ABOUT = """## About ClawBench

	### Why a new benchmark?
	Existing browser-agent benchmarks either run on synthetic / sandboxed websites (WebArena, VisualWebArena) or only check whether the agent reached the endpoint (WebVoyager). ClawBench runs on live, real-world websites and verifies the payload the agent submitted — so an agent that types the wrong delivery address into Uber Eats fails, even if its last HTTP request hit the correct endpoint.

	### Two corpora

	- V1 — 153 tasks across 144 real websites (the paper).
	- V2 — 130 newer everyday tasks across 63 platforms, expanded coverage of e-commerce / form-filling / authentication-walled flows.

	### Two-stage scoring

	\| Stage \| What it checks \| Output \|
	\|---\|---\|---\|
	\| 1. Interception \| Did the final HTTP request match the task's URL + method + canonical body schema? \| `intercepted ∈ {true, false}` \|
	\| 2. Judge \| Given the natural-language instruction and the intercepted payload, did the agent submit the right thing? \| `match ∈ {true, false, null}` \|

	`Reward = Intercepted ∧ Match`. Full prompt + judge model details: [eval/scoring.md ↗](https://github.com/reacher-z/ClawBench/blob/main/eval/scoring.md)

	### What ships with every run

	A 5-layer trace bundle (downloadable from the Traces datasets above):

	- `recording.mp4` — full browser session video
	- `actions.jsonl` — every click / type / scroll
	- `agent-messages.jsonl` — model inputs & outputs (incl. reasoning)
	- `requests.jsonl` — every HTTP request the page made
	- `interception.json` — graded final request
	- `run-meta.json` — model, harness, scores, timing

	### Reproducing

	```bash
	pip install clawbench-eval
	clawbench run --model <your-model> --harness hermes --corpus v2
	python scripts/clawbench_rescore.py --judge-model deepseek-v4-pro --only-batch <your-batch-dir>
	```
	"""

	SUBMIT = """## 🚀 Submit your model

	Submissions are accepted as PRs to the leaderboard CSV in the dataset repo:

	[Open the CSV in the dataset repo ↗](https://huggingface.co/datasets/TIGER-Lab/ClawBench/blob/main/leaderboard/results.csv)

	### Required steps

	1. Run the benchmark — install `pip install clawbench-eval`, then `clawbench run --model <your-model> --harness hermes --corpus v2` (or `v1`). Use the included harnesses (hermes / openclaw) so traces follow the standard 5-layer format.
	2. Score — `python scripts/clawbench_rescore.py --judge-model deepseek-v4-pro --only-batch <your-batch-dir>` produces `rescore-summary.json` with the cells you'll need.
	3. Upload traces (recommended) — push the 5-layer run bundles to `TIGER-Lab/ClawBenchV2Trace` (or `NAIL-Group/ClawBenchV1Trace`) so others can audit.
	4. Open a PR — add one row per `(model, harness, corpus)` to `leaderboard/results.csv` with columns: `model,harness,dataset,passed,total,pass_rate,reward_rate,wall_hours`. Link the trace bundle in the PR description.

	We re-run a sample of your submitted traces with our judge before merging — to keep the table honest.

	For step-by-step instructions, see [`eval/scoring.md`](https://github.com/reacher-z/ClawBench/blob/main/eval/scoring.md).
	"""


	def _format_pct(v) -> str:
	return "—" if pd.isna(v) else f"{v:.2f}%"


	def _format_wall(v) -> str:
	return "—" if pd.isna(v) else f"{v:.2f}"


	CORPUS_SIZE = {"v1": 153, "v2": 130}


	def load_results() -> pd.DataFrame:
	# Cache-busting query param + no-cache header — guarantees we never get a
	# CDN-cached snapshot of leaderboard/results.csv when the Space container is
	# warm. Combined with demo.load() this means every page view sees the latest
	# numbers from the dataset CSV without manual refresh.
	url = f"{RESULTS_URL}?t={int(time.time())}"
	req = urllib.request.Request(url, headers={"Cache-Control": "no-cache", "Pragma": "no-cache"})
	raw = urllib.request.urlopen(req, timeout=30).read()
	df = pd.read_csv(io.BytesIO(raw))
	# Defensive dedup: in case the CSV ever contains accidental duplicates of
	# (model, harness, dataset), keep only the most-recently-listed row.
	df = df.drop_duplicates(subset=["model", "harness", "dataset"], keep="last")
	if "reward_rate" not in df.columns:
	df["reward_rate"] = pd.NA
	# Rank by Reward (Stage 2 rate over attempted tasks) descending — Reward is
	# the headline metric, so the default view leads with it. Intercepted is the
	# tiebreak. Partial batches keep their attempted-rate.
	# Sort by Reward (lenient) DESC — headline metric. Intercepted is tiebreak.
	df = df.sort_values(
	["dataset", "reward_rate", "pass_rate"],
	ascending=[True, False, False],
	na_position="last",
	).reset_index(drop=True)
	df.insert(0, "rank", df.groupby("dataset").cumcount() + 1)
	df["pass_rate"] = df["pass_rate"].map(_format_pct)
	df["reward_rate"] = df["reward_rate"].map(_format_pct)
	if "reward_rate_strict" not in df.columns:
	df["reward_rate_strict"] = pd.NA
	df["reward_rate_strict"] = df["reward_rate_strict"].map(_format_pct)
	df["wall_hours"] = df["wall_hours"].map(_format_wall)
	df.rename(
	columns={
	"model": "Model",
	"harness": "Harness",
	"dataset": "Corpus",
	"passed": "Pass",
	"total": "Total",
	"pass_rate": "Intercepted",
	"reward_rate": "Reward (lenient)",
	"reward_rate_strict": "Reward (strict)",
	"wall_hours": "Wall (h)",
	"rank": "Rank",
	},
	inplace=True,
	)
	return df[["Rank", "Model", "Harness", "Corpus", "Intercepted", "Reward (lenient)", "Reward (strict)", "Pass", "Total", "Wall (h)"]]


	def filter_df(query: str, corpus: str, harness_filter: list[str]):
	df = load_results()
	if corpus and corpus != "all":
	df = df[df["Corpus"].str.lower() == corpus.lower()]
	if harness_filter:
	df = df[df["Harness"].isin(harness_filter)]
	if query:
	q = query.strip().lower()
	df = df[df["Model"].str.lower().str.contains(q, na=False)]
	return df.reset_index(drop=True)


	def all_harnesses() -> list[str]:
	try:
	df = load_results()
	return sorted(df["Harness"].dropna().unique().tolist())
	except Exception:
	return ["hermes", "openclaw"]


	# Default headline view: V2 + Hermes — matches the website's "V2 (Hermes)" main tab.
	# Mirrors https://claw-bench.com/leaderboard where V2 Hermes is the primary corpus
	# and other harness/corpus combos are tucked behind an "Others" toggle.
	DEFAULT_CORPUS = "v2"
	DEFAULT_HARNESSES = ["hermes"]


	def reset_defaults():
	"""Restore the headline V2-Hermes view (search cleared, corpus=v2, harness=[hermes])."""
	return "", DEFAULT_CORPUS, DEFAULT_HARNESSES


	with gr.Blocks(title="ClawBench Leaderboard", theme=gr.themes.Soft()) as demo:
	gr.Markdown(INTRO)

	with gr.Tabs():
	with gr.TabItem("📊 Leaderboard"):
	with gr.Row():
	with gr.Accordion("Citation", open=False):
	gr.Textbox(value=CITATION, label="BibTeX", lines=8, interactive=False)
	gr.Markdown(TABLE_INTRO)
	with gr.Row():
	search_bar = gr.Textbox(placeholder="Search models…", show_label=False, scale=3)
	corpus_choice = gr.Radio(choices=["all", "v2", "v1"], value=DEFAULT_CORPUS, label="Corpus", scale=2)
	harness_choice = gr.CheckboxGroup(
	choices=all_harnesses(),
	value=DEFAULT_HARNESSES,
	label="Harness",
	)
	# Empty placeholder — the real data is fetched fresh from the dataset on
	# every page load via demo.load() below. This guarantees the table never
	# shows a stale module-import snapshot when the Space container has been
	# warm for a while.
	table = gr.Dataframe(
	value=None,
	interactive=False,
	wrap=True,
	column_widths=["60px", "260px", "100px", "70px", "110px", "110px", "110px", "60px", "60px", "80px"],
	)
	with gr.Row():
	refresh = gr.Button("🔄 Refresh from dataset")
	reset = gr.Button("↺ Reset to V2 Hermes default")

	for control in (search_bar, corpus_choice, harness_choice):
	control.change(
	fn=filter_df,
	inputs=[search_bar, corpus_choice, harness_choice],
	outputs=table,
	)
	refresh.click(fn=filter_df, inputs=[search_bar, corpus_choice, harness_choice], outputs=table)
	# Reset writes the defaults back into the three controls; their .change
	# listeners then fan out and re-render the table from the new state.
	reset.click(fn=reset_defaults, inputs=[], outputs=[search_bar, corpus_choice, harness_choice])

	with gr.TabItem("📝 About"):
	gr.Markdown(ABOUT)

	with gr.TabItem("🚀 Submit here"):
	gr.Markdown(SUBMIT)

	# Force a fresh fetch from the dataset CSV on every page load — bypasses any
	# warm-container module-level cache so each visitor sees canonical numbers
	# without having to click "Refresh from dataset".
	demo.load(
	fn=filter_df,
	inputs=[search_bar, corpus_choice, harness_choice],
	outputs=table,
	)


	if __name__ == "__main__":
	demo.launch()