Italian-Legal-BERT Brings Domain-Specific NLP to Italian Law

Researchers at Scuola Superiore Sant'Anna in Pisa released Italian-Legal-BERT, a domain-adapted BERT model trained on 3.7 gigabytes of Italian civil law cases from the National Jurisprudential Archive. The 110-million parameter model outperforms general-purpose Italian BERT on legal NLP tasks like term prediction, holding extraction, and document classification. It's available on Hugging Face under the Academic Free License with seven model variants, a companion dataset, and a demo notebook.

TL;DR

110M-parameter BERT model domain-adapted on 3.7GB of Italian court decisions from the National Jurisprudential Archive
Outperforms general Italian BERT (dbmdz) on legal tasks - correctly predicts legal terms like "ricorrente" (applicant) at 72.6% confidence
Seven variants: base, from-scratch (6.6GB corpus), distilled, and long-sequence (4K/16K tokens)
Built by Daniele Licari and Giovanni Comande at Sant'Anna, Pisa - published at EKAW 2022 and in Computer Law & Security Review (2024)
~1,600 downloads/month on Hugging Face - the most prominent non-English legal BERT model

Why Legal Italian Needs Its Own Model

Italian legal language is a dialect unto itself. Court decisions preserve Latin-derived vocabulary that doesn't appear in modern standard Italian. Sentences routinely span entire paragraphs with deeply nested subordinate clauses. Civil law, criminal law, and administrative law each carry distinct terminologies. And Italian court judgments follow rigid rhetorical structures - fatto (facts), diritto (legal reasoning), dispositivo (ruling) - that general language models haven't learned to parse.

Standard Italian BERT, trained on Wikipedia and web text, has poor coverage of this vocabulary. When asked to fill a masked token in "Il [MASK] ha chiesto revocarsi l'obbligo di pagamento" (The [MASK] requested the payment obligation be revoked), general BERT guesses common nouns. Italian-Legal-BERT predicts "ricorrente" (applicant) at 72.6% confidence, followed by "convenuto" (defendant) at 9.6% and "resistente" (respondent) at 4.0% - exactly the terms a lawyer would expect.

How It Was Built

The researchers took two approaches. The primary model, Italian-Legal-BERT, starts from dbmdz/bert-base-italian-xxl-cased - a strong general Italian BERT trained on 81GB of text by the Bavarian State Library - and continues pre-training on 3.7GB of preprocessed Italian civil law text for 4 epochs (8.4 million steps) on a single NVIDIA V100.

The from-scratch variant, Italian-Legal-BERT-SC, uses a CamemBERT (RoBERTa) architecture with a custom SentencePiece tokenizer trained on legal text. It was pre-trained on a larger 6.6GB corpus covering both civil and criminal cases for 1 million steps across 8x NVIDIA A100 GPUs.

Both approaches validate the same finding from English Legal-BERT research: domain-specific continued pre-training consistently beats general models on downstream legal tasks, even when the domain corpus is relatively small.

Model Variants

The team released a full family of models on Hugging Face:

Model	Purpose	Size	Downloads/mo
Italian-Legal-BERT	Domain-adapted base	110M	~1,600
Italian-Legal-BERT-SC	Trained from scratch	110M	~40
distil-ita-legal-bert	Distilled for embeddings	4 layers	~590
lsg16k-Italian-Legal-BERT	Long sequence (16K tokens)	110M	~85
lsg4k-Italian-Legal-BERT	Long sequence (4K tokens)	110M	~1

The distilled variant produces 768-dimensional sentence embeddings optimized for semantic similarity in legal text - useful for finding related case law. The long-sequence variants extend context from BERT's standard 512 tokens to 4K or 16K, addressing the paragraph-length sentences common in Italian court decisions.

Applications

The model has been applied to several legal NLP tasks:

Legal holding extraction: Identifying the key legal principle in court decisions, evaluated on the ITA-CASEHOLD dataset (1,101 Italian judgment-holding pairs from 2019-2022)
Rhetorical role classification: Automatically labeling sections of legal documents as facts, reasoning, or rulings
Named entity recognition: Extracting parties, courts, statutes, and legal concepts from text
Banking regulation Q&A: A 2024 paper applied the model to supervisory regulation question answering
Sentence similarity: Finding semantically related provisions across documents

A third-party fine-tune for Italian question answering (SQuAD-IT format) also exists on Hugging Face.

Context

Italian-Legal-BERT is the most prominent non-English legal BERT model available. English Legal-BERT from the Athens University of Economics (Chalkidis et al., EMNLP 2020), trained on 12GB of EU, UK, and US law, gets 232,000 downloads per month and established the paradigm. Similar models exist for Portuguese (Brazilian law) and Romanian, but Italian-Legal-BERT leads in community adoption outside English with 38 likes on Hugging Face.

The work comes from LIDER-LAB, an interdisciplinary legal research lab at Sant'Anna founded by co-author Giovanni Comande, a comparative law professor with an LLM from Harvard. The lab's focus on predictive justice and AI governance positions the model as part of a broader research agenda on computational legal analysis in civil law systems.

For a deeper look at the architecture, training, and benchmark evaluations, see the model card.

Sources:

Why Legal Italian Needs Its Own Model

How It Was Built

Model Variants

Applications

Context

Google Analytics