Are AI assistants already better than lawyers?

“Four of the seven legal tasks examined were performed better by AI than by lawyers.”

This is the striking conclusion of the first large-scale benchmark study of legal AI assistants, conducted by Vals AI—a company that develops independent, practice-oriented benchmarks to evaluate language models fairly and accurately on realistic, domain-specific tasks using its own secure evaluation infrastructure—in collaboration with Legaltech Hub. In this study, published in late February 2025, four widely used AI assistants were tested on seven concrete legal tasks. Their performance was systematically compared with the work of lawyers in a control group—the so-called ‘Lawyer Baseline’.

The results leave little room for doubt: generative AI is on its way to becoming a fully-fledged legal tool. Some AI assistants already outperform lawyers in tasks such as answering legal questions about documents, summarizing legal texts, and analyzing transcripts. At the same time, there remain clearly defined situations where human expertise still performs better.

What exactly does this research look like? How precisely do the AI assistants perform? And what can legal professionals learn from this when selecting and applying AI in their daily practice?

The study design

The study did not use hypothetical cases, but real questions and documents from the daily practice of law firms.

Key tasks

In collaboration with eight international firms, a task list was established featuring seven diverse legal tasks:

  1. Data extraction

AI had to locate specific data (such as clauses, amounts, or conditions) in legal documents. Examples include finding a termination provision or drafting an overview of securities from multiple contracts.

  1. Answering questions about documents (Q&A)

The AI assistants were presented with legal questions regarding, among others, employment contracts, policy texts, or compliance documents, which are not further specified in the study. The crucial factor here was whether the answer was substantively correct and complete.

  1. Summarizing legal documents

AI was asked to summarize complex documents—such as legislative texts and contracts—into one or a few paragraphs.

  1. Redlining

This involved identifying deviations from a standard clause, assessing changes, or adjusting provisions themselves based on user instructions.

  1. Analysis of hearing transcripts

The AI assistants must extract relevant information from court transcripts, such as which party is represented by whom or when certain remarks or statements were made during the hearing.

  1. Drafting timelines

AI had to place facts and events from a document in the correct order, including dates and descriptions.

  1. EDGAR Research

A complex task where AI had to answer questions based on public documents in the American EDGAR database, which contains financial documents such as annual reports and prospectuses.

Methodology and assessment

For each task, the participating firms collected sample questions, accompanying documents, and clear assessment criteria. In total, this involved more than 500 scenarios, primarily sourced from large international firms. This makes the dataset a representative reflection of real legal work.

Four AI assistants participated: Harvey, CoCounsel (Thomson Reuters), Vincent AI (vLex), and Oliver (Vecflow). Each provider chose which tasks their tool would be tested on. LexisNexis withdrew prior to publication.

To properly compare performance, a Lawyer Baseline was also established: a control group of experienced lawyers performed the same tasks without the aid of AI. Their results were then systematically compared with the output of the AI assistants.

All answers were automatically assessed using a specialized evaluation model developed by Vals AI. This model operates on the so-called ‘LLM-as-judge’ principle: it compares each answer with a pre-established reference and tests individual legal elements—such as correct statutory references, relevant facts, or appropriate application of a standard—against objective criteria. Each component is graded with a pass/fail score and contributes to the overall score of the answer. This method allows for the scalable, consistent, and objective assessment of hundreds of answers—something that would take over 400 hours manually and would be subject to differences in interpretation between legal professionals.

How do AI assistants perform per legal task area?

The study provides a detailed picture of the performance of AI assistants across the seven legal tasks.

  1. Data extraction

Automatically retrieving specific information from documents is a task that AI now handles well. AI assistants such as Harvey (75.1%) and CoCounsel (73.2%) scored higher than the Lawyer Baseline, which stood at 71.1%. In this task, questions included “What is the term of this lease agreement?” or “What does the clause on assignability state?”

Performance was particularly strong for simple questions with short documents. As soon as multiple documents had to be combined or legal terms such as “most favored nation” were implicitly mentioned, errors occurred more frequently. Nevertheless, AI proves to offer a valuable starting point here.

  1. Document Q&A

AI assistants excel at answering specific questions about legal documents. With an average score of 80.2%, this was the best-performing task category in the entire study. Harvey Assistant (94.8%) and CoCounsel (89.6%) set the standard.

Example questions include: “Can this contract be terminated in the event of default?” or “What are the landlord’s obligations according to Article X?” AI proved to be not only fast but also consistent in identifying relevant passages—often more accurately than the lawyers, who sometimes forgot to mention crucial details.

  1. Summarizing documents

Summarizing the main points of legal documents—for example, a prospectus or contract—is a task that AI performs remarkably well. All tested AI assistants scored higher than the Lawyer Baseline (50.3%). CoCounsel achieved the highest score at 77.2%, followed by Harvey at 72.1%.

Accuracy appeared to depend partly on how concise the summary needed to be. Assistants that provided longer answers often scored better because they were able to identify more relevant elements. AI can thus be a useful tool for initial document exploration.

  1. Redlining

Adjusting or analyzing contract texts based on a standard provision proves to be one of the most difficult tasks for AI. The Lawyer Baseline here was 79.7%—significantly higher than the top-scoring AI tool (Harvey: 65.0%).

The difference lies primarily in nuance. While lawyers carefully weigh how a provision should be adapted to the context, AI assistants often literally copy standard texts into the contract. Complex reformulations or considerations are lacking. For now, AI is not suitable as a standalone legal instrument for this purpose.

  1. Transcript Analysis

When analyzing hearing transcripts and procedural documents, AI showed a convincing result. Both Harvey and Vincent AI scored well above the Lawyer Baseline of 53.7%, with 77.8% and 64.8% respectively.

The challenge in this task lies in the messy formatting of transcriptions and the need to establish connections across multiple pages. AI proves surprisingly capable of linking speakers, context, and content—a task that is very time-consuming for legal professionals.

  1. Chronology Building

Generating a chronological overview of events—for example, in a dispute or compliance investigation—proved to be a task in which both AI and legal professionals scored strongly. Harvey and the Lawyer Baseline finished tied at 80.2%. CoCounsel followed with 78%.

Although the differences are small, this is a textbook example of the power of a ‘human + machine’ approach: AI provides an initial draft of the timeline, after which a legal professional checks the details and supplements them where necessary.

  1. EDGAR Research

Researching American stock exchange documentation via the EDGAR system was by far the most difficult task for AI. The only tool that ventured to participate in this was Oliver, which scored 55.2%, significantly lower than the Lawyer Baseline of 70.1%.

Conclusion

The benchmark study by Vals AI shows that AI assistants now outperform lawyers on four of the seven investigated legal tasks. AI shows its strongest side particularly in document Q&A, summaries, and transcript analysis. Nevertheless, clear boundaries remain: for tasks requiring more legal nuance or judgment, such as redlining or open searches in EDGAR, humans are still better for the time being.

The strength of this study lies in the combination of realistic practical situations, objective assessment, and a fair comparison with the work of experienced legal professionals. At the same time, there are limitations: some AI assistants were tested outside their optimal workflow, and the automatic assessment by Vals AI—based on the ‘LLM-as-judge’ method—is a reasonable approximation of human review, but not a replacement for an experienced legal professional. Where necessary, a second check was performed on erroneous scores, but full legal assessment remains desirable for a definitive judgment.

Nevertheless, the conclusion is clear: generative AI is rapidly developing into a fully-fledged tool within legal practice. By combining these AI assistants with human legal knowledge and experience, a powerful new way of working emerges in which time savings, quality, and accessibility can go hand in hand.

 

 

LegalMike in Action

Every two weeks on Friday afternoons, we organize a digital knowledge session. During these sessions, we demonstrate how to optimally utilize LegalMike in your legal practice, from real-world examples to practical tips.

The next knowledge session will take place on April 10.

Or join directly via Google Meet.