AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

1Tel Aviv University 2University of Pennsylvania 3Allen Institute for AI 4University of Washington 5Princeton University

About AssistantBench

AssistantBench evaluates the ability of web agents to automatically solve realistic and time-consuming tasks. The benchmark includes 214 tasks covering multiple domains from more than 525 pages from 258 different websites. Please check out our paper for more details.

Tasks in AssistantBench require navigating the web to find information from multiple websites
Tasks in AssistantBench require navigating the web to find relevant information

SeePlanAct (SPA)

As tasks in AssistantBench require planning the execution and transferring information between steps, we introduce a new web agent built to tackle tasks in AssistantBench by equipping SeeAct with specialized planning and memory components.

Tasks in AssistantBench require navigating the web to find information from multiple websites
We equip SeeAct with specialized planning and memory components (green)

Performance on AssistantBench

Even though SPA outperforms previous agents, AssistantBench is challenging for current benchmarks, with the best performance currently at 25.2%.

Model Accuracy Answer rate Precision Exact match
SPA (ours) → Closed-book 25.2 91.3 7.5 9.9
SeeAct → Closed-book 23.3 89.1 26.3 9.4
Closed-book LM (1-shot) 22.2 89.1 25.0 8.2
Retrieval-augmented LM (1-shot) → CB 19.4 92.5 21.2 6.1
Retrieval-augmented LM (0-shot) → CB 18.7 93.6 20.0 6.7
Closed-book LM (0-shot) 16.3 53.3 30.5 6.0
Retrieval-augmented LM (0-shot) 11.7 59.8 19.8 5.5
SPA (ours) 11.0 38.8 29.0 5.5
Retrieval-augmented LM (1-shot) 10.6 48.3 22.3 3.8
SeeAct 4.2 20.0 19.6 2.3
Performance on the AssistantBench test set with GPT4-T

How was AssistantBench collected?

To create AssistantBench, we first collected a seed set by asking participants in a study to share time-consuming web tasks they recently had. We expanded the seed set by showing tasks from as templates for crowd-workers. Finally, domain-experts shared domain-specific tasks to increase diversity.

Tasks in AssistantBench require navigating the web to find information from multiple websites
AssistantBench includes general and domain-specific realistic tasks

Getting started

To get started with AssistantBench, simply download our HuggingFace dataset. We provide a development set with task answers, URLs, and explanations. We keep the test set answers hidden for now and provide the option to submit predictions via our HuggingFace portal.

BibTeX

@misc{yoran2024assistantbenchwebagentssolve,
      title={AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?},
      author={Ori Yoran and Samuel Joseph Amouyal and Chaitanya Malaviya and Ben Bogin and Ofir Press and Jonathan Berant},
      year={2024},
      eprint={2407.15711},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.15711},
}