AssistantBench evaluates the ability of web agents to automatically solve realistic and time-consuming tasks. The benchmark includes 214 tasks covering multiple domains from more than 525 pages from 258 different websites. Please check out our paper for more details.
As tasks in AssistantBench require planning the execution and transferring information between steps, we introduce a new web agent built to tackle tasks in AssistantBench by equipping SeeAct with specialized planning and memory components.
Even though SPA outperforms previous agents, AssistantBench is challenging for current benchmarks, with the best performance currently at 25.2%.
Model | Accuracy | Answer rate | Precision | Exact match |
---|---|---|---|---|
SPA (ours) → Closed-book | 25.2 | 91.3 | 7.5 | 9.9 |
SeeAct → Closed-book | 23.3 | 89.1 | 26.3 | 9.4 |
Closed-book LM (1-shot) | 22.2 | 89.1 | 25.0 | 8.2 |
Retrieval-augmented LM (1-shot) → CB | 19.4 | 92.5 | 21.2 | 6.1 |
Retrieval-augmented LM (0-shot) → CB | 18.7 | 93.6 | 20.0 | 6.7 |
Closed-book LM (0-shot) | 16.3 | 53.3 | 30.5 | 6.0 |
Retrieval-augmented LM (0-shot) | 11.7 | 59.8 | 19.8 | 5.5 |
SPA (ours) | 11.0 | 38.8 | 29.0 | 5.5 |
Retrieval-augmented LM (1-shot) | 10.6 | 48.3 | 22.3 | 3.8 |
SeeAct | 4.2 | 20.0 | 19.6 | 2.3 |
To create AssistantBench, we first collected a seed set by asking participants in a study to share time-consuming web tasks they recently had. We expanded the seed set by showing tasks from as templates for crowd-workers. Finally, domain-experts shared domain-specific tasks to increase diversity.
To get started with AssistantBench, simply download our HuggingFace dataset. We provide a development set with task answers, URLs, and explanations. We keep the test set answers hidden for now and provide the option to submit predictions via our HuggingFace portal.
BibTeX
@misc{yoran2024assistantbenchwebagentssolve,
title={AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?},
author={Ori Yoran and Samuel Joseph Amouyal and Chaitanya Malaviya and Ben Bogin and Ofir Press and Jonathan Berant},
year={2024},
eprint={2407.15711},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.15711},
}