Italian-Legal-BERT Brings Domain-Specific NLP to Italian Law

Researchers at Scuola Superiore Sant'Anna in Pisa built Italian-Legal-BERT, a 110M-parameter model trained on 3.7GB of Italian court decisions that outperforms general Italian BERT on legal NLP tasks.

Italian-Legal-BERT Brings Domain-Specific NLP to Italian Law

Researchers at Scuola Superiore Sant'Anna in Pisa released Italian-Legal-BERT, a domain-adapted BERT model trained on 3.7 gigabytes of Italian civil law cases from the National Jurisprudential Archive. The 110-million parameter model outperforms general-purpose Italian BERT on legal NLP tasks like term prediction, holding extraction, and document classification. It's available on Hugging Face under the Academic Free License with seven model variants, a companion dataset, and a demo notebook.

TL;DR

  • 110M-parameter BERT model domain-adapted on 3.7GB of Italian court decisions from the National Jurisprudential Archive
  • Outperforms general Italian BERT (dbmdz) on legal tasks - correctly predicts legal terms like "ricorrente" (applicant) at 72.6% confidence
  • Seven variants: base, from-scratch (6.6GB corpus), distilled, and long-sequence (4K/16K tokens)
  • Built by Daniele Licari and Giovanni Comande at Sant'Anna, Pisa - published at EKAW 2022 and in Computer Law & Security Review (2024)
  • ~1,600 downloads/month on Hugging Face - the most prominent non-English legal BERT model

Italian legal language is a dialect unto itself. Court decisions preserve Latin-derived vocabulary that doesn't appear in modern standard Italian. Sentences routinely span entire paragraphs with deeply nested subordinate clauses. Civil law, criminal law, and administrative law each carry distinct terminologies. And Italian court judgments follow rigid rhetorical structures - fatto (facts), diritto (legal reasoning), dispositivo (ruling) - that general language models haven't learned to parse.

Standard Italian BERT, trained on Wikipedia and web text, has poor coverage of this vocabulary. When asked to fill a masked token in "Il [MASK] ha chiesto revocarsi l'obbligo di pagamento" (The [MASK] requested the payment obligation be revoked), general BERT guesses common nouns. Italian-Legal-BERT predicts "ricorrente" (applicant) at 72.6% confidence, followed by "convenuto" (defendant) at 9.6% and "resistente" (respondent) at 4.0% - exactly the terms a lawyer would expect.

How It Was Built

The researchers took two approaches. The primary model, Italian-Legal-BERT, starts from dbmdz/bert-base-italian-xxl-cased - a strong general Italian BERT trained on 81GB of text by the Bavarian State Library - and continues pre-training on 3.7GB of preprocessed Italian civil law text for 4 epochs (8.4 million steps) on a single NVIDIA V100.

The from-scratch variant, Italian-Legal-BERT-SC, uses a CamemBERT (RoBERTa) architecture with a custom SentencePiece tokenizer trained on legal text. It was pre-trained on a larger 6.6GB corpus covering both civil and criminal cases for 1 million steps across 8x NVIDIA A100 GPUs.

Both approaches validate the same finding from English Legal-BERT research: domain-specific continued pre-training consistently beats general models on downstream legal tasks, even when the domain corpus is relatively small.

Model Variants

The team released a full family of models on Hugging Face:

ModelPurposeSizeDownloads/mo
Italian-Legal-BERTDomain-adapted base110M~1,600
Italian-Legal-BERT-SCTrained from scratch110M~40
distil-ita-legal-bertDistilled for embeddings4 layers~590
lsg16k-Italian-Legal-BERTLong sequence (16K tokens)110M~85
lsg4k-Italian-Legal-BERTLong sequence (4K tokens)110M~1

The distilled variant produces 768-dimensional sentence embeddings optimized for semantic similarity in legal text - useful for finding related case law. The long-sequence variants extend context from BERT's standard 512 tokens to 4K or 16K, addressing the paragraph-length sentences common in Italian court decisions.

Applications

The model has been applied to several legal NLP tasks:

  • Legal holding extraction: Identifying the key legal principle in court decisions, evaluated on the ITA-CASEHOLD dataset (1,101 Italian judgment-holding pairs from 2019-2022)
  • Rhetorical role classification: Automatically labeling sections of legal documents as facts, reasoning, or rulings
  • Named entity recognition: Extracting parties, courts, statutes, and legal concepts from text
  • Banking regulation Q&A: A 2024 paper applied the model to supervisory regulation question answering
  • Sentence similarity: Finding semantically related provisions across documents

A third-party fine-tune for Italian question answering (SQuAD-IT format) also exists on Hugging Face.

Context

Italian-Legal-BERT is the most prominent non-English legal BERT model available. English Legal-BERT from the Athens University of Economics (Chalkidis et al., EMNLP 2020), trained on 12GB of EU, UK, and US law, gets 232,000 downloads per month and established the paradigm. Similar models exist for Portuguese (Brazilian law) and Romanian, but Italian-Legal-BERT leads in community adoption outside English with 38 likes on Hugging Face.

The work comes from LIDER-LAB, an interdisciplinary legal research lab at Sant'Anna founded by co-author Giovanni Comande, a comparative law professor with an LLM from Harvard. The lab's focus on predictive justice and AI governance positions the model as part of a broader research agenda on computational legal analysis in civil law systems.

For a deeper look at the architecture, training, and benchmark evaluations, see the model card.

Sources:

Italian-Legal-BERT Brings Domain-Specific NLP to Italian Law
About the author Senior AI Editor & Investigative Journalist

Elena is a technology journalist with over eight years of experience covering artificial intelligence, machine learning, and the startup ecosystem.