BigLawBench

BigLawBench is Harvey AI's earlier evaluation suite for legal LLMs, focused on how well a model produces lawyer-quality work product when the task and supporting materials are provided in a single prompt. It predates the agent benchmark that this site explores.

Three components

BigLawBench Core. Hand-written tasks split into Transactional and Litigation work, each with around eight subcategories such as drafting, due diligence, legal research, and case management. Each task ships with a long prose rubric scoring both substance and structure.
BigLawBench Workflows. Composite, longer-running tasks where a model has to chain multiple steps to complete realistic deal or matter work.
BigLawBench Retrieval. A test of how well a retrieval system can surface the right passage from a body of source documents, scored separately at the document and passage level.

How it differs from the agent benchmark

The agent benchmark explored on this site (harvey-labs) gives a model an environment with tools, files, and a workspace, and judges only whether the final deliverable meets a long checklist of objective pass-or-fail criteria. BigLawBench instead asks a question in a single prompt and judges the answer with a more open-ended rubric that scores both content and presentation. The two benchmarks are complementary: BigLawBench measures answer quality on a single turn, while the agent benchmark measures whether a model can drive multi-step legal work end to end.

Public samples

Harvey publishes a small sample of BigLawBench in its GitHub repository. The full evaluation set is held privately to prevent contamination.

blb-core/core-samples.csv — about 50 core task samples with prompts, attached documents, and rubrics.
blb-retrieval/samples.csv — retrieval queries paired with their source domain.
blb-workflows/spa — a sample stock purchase agreement workflow.

Three components

How it differs from the agent benchmark

Public samples

Further reading