Can you trust generative AI for legal work?

“The central question we asked at the outset was simple yet fundamental: can you trust these four systems for daily legal work? The answer turns out to be nuanced but clear: yes, provided you deploy the systems where their strengths lie and systematically verify their output.”

This is the conclusion of a recent study by ICTRecht, a legal consultancy specialized in ICT and law. In this study, a team of legal experts tested four generative AI systems: ChatGPT, Claude, Copilot, and Gemini. The goal? To gain insight into how reliable and useful these systems are for performing legal work. In doing so, they looked not only at the performance of the various systems but also at the differences between free and paid versions.

The results leave nothing to be desired in terms of clarity: for instance, the free versions of the AI systems studied are unsuitable for legal work. They lack consistency, provide incorrect answers more frequently, and have limited functionalities. But what does this mean for legal professionals who want to integrate AI into their work processes? Which AI system performs best in the legal field? What are the greatest risks when using AI for legal analyses? And how do you ensure that AI becomes a valuable addition rather than a source of errors?

The setup of the study

Since legal work requires precision, nuance, and a deep understanding of laws and regulations, the ICTRecht study was deliberately set up as a practical test that aligns with the daily working methods of legal professionals. For example, when testing the four AI systems, no use was made of optimized prompts or extensive additional instructions, but rather of direct questions such as an average lawyer would ask. As a result, the test results reflect purely the baseline performance of the AI systems.

To obtain a broad picture of AI performance in different legal contexts, five areas of law were selected:

  1. Privacy law
  2. Contract law
  3. Intellectual property law
  4. Corporate law
  5. Employment law

Three types of questions were presented per area of law:

  • Basic questions about legislation – For example: “What transfer mechanisms does the GDPR recognize?”
  • Practice-oriented questions with case law – For example: “How does the protection of databases under the Database Directive differ from protection under copyright law?”
  • Complex cases – For example: “An employee with a non-compete clause wants to move to a new employer after a merger, but the new position partially overlaps (since the merger) with the prohibited field of work. What are the consequences of this?”

The full list of questions per area of law has been published as a separate appendix to the study.

Furthermore, an assessment model was used that measures three core aspects of legal quality:

  • Accuracy: are the statutory articles, deadlines, and requirements correct?
  • Relevance: does the AI answer the core of the question?
  • Completeness: is the legal analysis sufficiently in-depth?

The generated answers were assessed as a corporate lawyer would review legal advice: not only on correctness but also on applicability and substantiation.

The four AI systems were tested in two phases:

  • December 2024 – Free versions (each question was asked twice to measure consistency).
  • January 2025 – Paid versions (a single test round, as the answers proved to be more consistent).

In addition, work was conducted exclusively with the standard chat interface, without extra system prompts or settings.

Insights from the study

The research results show that the four AI systems can play a supporting role within the legal sector, provided the systems are deployed correctly. The performance of the AI systems varies significantly, with paid models in particular distinguishing themselves in accuracy and consistency. At the same time, there remains a structural risk of errors and inaccuracies when using the tested AI systems.

Key limitations and risks when using the four AI systems for legal work:

  • AI lacks legal depth – paid AI models generate extensive answers but often lack the necessary legal nuance. Although relevance is high, answers regularly contain irrelevant digressions.
  • Blind trust in AI is risky – all tested systems have a tendency to generate non-existent case law and statutory articles. This risk is smaller with paid versions. Some AI answers contain legal reasoning based on incorrect assumptions.
  • Free versions are not usable for legal work – the free versions show substandard performance in virtually all areas, and it is therefore strongly advised against using them in legal practice.
  • Confidentiality – free versions of AI systems often do not contain explicit confidentiality provisions and may process client input for further training of the models. As a result, it is not always clear how the information is used or stored. The enterprise versions of AI systems in some cases offer explicit confidentiality provisions, comparable to a Non-Disclosure Agreement (NDA). This provides some protection but requires organizations to maintain clear internal guidelines for the use of AI with confidential files.
  • Data protection and privacy – the four tested AI systems run on American servers. This means that any legal question or document entered into these systems may be processed outside the EU. This raises GDPR compliance issues, especially due to US laws such as the CLOUD Act, which can give US authorities access to stored data, even if it is located in the EU. Paid versions, and particularly enterprise licenses, often contain stricter provisions regarding data security, confidentiality, and liability. Microsoft’s Copilot Enterprise with EU data boundaries offers the most robust safeguards, but other providers are also introducing security measures.

The strengths of the four AI systems when used for legal work:

  • AI is suitable for exploratory legal work – paid AI models can be effectively deployed for structuring legal information, such as drafting checklists for due diligence and categorizing contract provisions. This does require systematic verification of all source references.
  • Claude 3.5 Sonnet and ChatGPT 4o Plus perform best – these paid models score highest on legal accuracy and consistency. Claude 3.5 Sonnet excels in completeness and analytical capability. Microsoft Copilot and Google Gemini deliver varying results, with significant quality differences between answers.
  • AI performs better in complex analyses than in basic questions – contrary to expectations, AI models score better on complex, logically structured issues than on simple legislative questions. This suggests that the tested systems are better suited for legal analyses and argumentation structures than for simply reproducing legal texts and fixed rules.
  • European law is better understood than national law – the four tested AI systems perform strongly in harmonized areas of law such as privacy law and intellectual property.
  • AI is a tool, not a replacement – the systems are useful for orientational legal research and can help in quickly exploring new areas of law.

Conclusion

Generative AI has the potential to support legal professionals in structuring and analyzing information, but the ICTRecht study shows that this technology is not yet flawless. The paid versions of the four tested AI models, such as Claude 3.5 Sonnet and ChatGPT 4o Plus, deliver the best performance in terms of legal accuracy and consistency. Yet limitations remain: AI lacks consistent legal depth, sometimes generates incorrect sources, and struggles with national law.

The biggest pitfall lies in blind trust in AI output. Systems present answers convincingly, even when they are legally incorrect. This means that every source must be checked and that AI output cannot simply be used as legal advice. Additionally, there are significant compliance challenges, especially regarding data protection and confidentiality. Many AI systems run on US servers and fall under laws such as the CLOUD Act. This can entail legal risks.

In short, generative AI can support legal professionals, but human expertise remains indispensable. The future lies in a combination of legal knowledge and technological skills, where AI is not the decision-maker but a powerful tool in the hands of the trained legal professional.

LegalMike in Action

Every two weeks on Friday afternoons, we organize a digital knowledge session. During these sessions, we demonstrate how to optimally utilize LegalMike in your legal practice, from real-world examples to practical tips.

The next knowledge session will take place on April 10.

Or join directly via Google Meet.