Name: Italian-Legal-BERT
Author: Scuola Superiore Sant'Anna

Italian-Legal-BERT is a domain-adapted BERT model purpose-built for Italian legal text. Created by Daniele Licari and Giovanni Comande at Scuola Superiore Sant'Anna in Pisa, it takes a strong general Italian BERT and continues pre-training on 3.7 gigabytes of civil law decisions from Italy's National Jurisprudential Archive. The result is a 110-million parameter model that correctly handles the archaic vocabulary, paragraph-length sentences, and rigid rhetorical structures of Italian court documents where general-purpose models fail.

TL;DR

Domain-adapted BERT trained on 3.7GB of Italian civil law from the National Jurisprudential Archive
Seven variants: base, from-scratch, distilled, long-sequence (4K/16K)
Applied to holding extraction, NER, rhetorical classification, legal Q&A
~1,600 downloads/month on Hugging Face - leading non-English legal BERT
AFL-3.0 license, runs on a single GPU

Key Specifications

Specification	Details
Provider	Scuola Superiore Sant'Anna, Pisa
Authors	Daniele Licari, Giovanni Comande
Parameters	~110M
Architecture	BERT (domain-adapted) / CamemBERT-RoBERTa (from-scratch)
Base Model	dbmdz/bert-base-italian-xxl-cased (81GB Italian corpus)
Training Data	3.7 GB Italian civil law (adapted); 6.6 GB civil + criminal (from-scratch)
Training Steps	8.4M steps / 4 epochs (adapted); 1M steps (from-scratch)
Context Window	512 tokens (base); 4,096 / 16,384 tokens (LSG variants)
Vocabulary	32K tokens (from-scratch variant uses custom SentencePiece)
Hardware	1x V100 16GB (adapted); 8x A100 40GB (from-scratch)
License	AFL-3.0 (Academic Free License)
Release	September 2022 (EKAW), extended paper 2024 (Computer Law & Security Review)

Why Legal Italian Needs Domain Adaptation

Italian legal language (linguaggio giuridico) diverges sharply from standard Italian. Court decisions preserve Latin-derived terms - "ricorrente" (applicant), "convenuto" (defendant), "resistente" (respondent) - that rarely appear in web text or Wikipedia. Sentences routinely span entire paragraphs with deeply nested subordinate clauses. Each branch of law (civil, criminal, administrative) carries distinct vocabulary. And Italian judgments follow rigid rhetorical patterns: fatto (facts), diritto (legal reasoning), dispositivo (ruling).

General Italian BERT, trained on 81GB of Wikipedia and web crawl data, has near-zero coverage of this specialized vocabulary. Fill-mask tests demonstrate the gap clearly:

Prompt: "Il [MASK] ha chiesto revocarsi l'obbligo di pagamento" (The [MASK] requested the payment obligation be revoked)

Model	Top prediction	Confidence
Italian-Legal-BERT	ricorrente (applicant)	72.6%
Italian-Legal-BERT	convenuto (defendant)	9.6%
Italian-Legal-BERT	resistente (respondent)	4.0%
General Italian BERT	comune (municipality)	-

The legal model predicts the three most contextually appropriate legal parties. The general model guesses a common noun unrelated to the legal context.

Model Variants

The research team released a full family of models addressing different use cases:

Model	Approach	Training Data	Context	Downloads/mo
Italian-Legal-BERT	Domain-adapted	3.7 GB civil law	512	~1,600
Italian-Legal-BERT-SC	From scratch	6.6 GB civil + criminal	512	~40
distil-ita-legal-bert	Distilled embeddings	Same as base	512	~590
lsg16k-Italian-Legal-BERT	Long sequence	3.7 GB civil	16,384	~85
lsg4k-Italian-Legal-BERT	Long sequence	3.7 GB civil	4,096	~1

Domain-adapted (base): Starts from dbmdz/bert-base-italian-xxl-cased and continues pre-training on legal text. This approach preserves general language understanding while specializing on legal vocabulary. Trained on a single V100 for 4 epochs.

From-scratch (SC): Uses a CamemBERT/RoBERTa architecture with a custom SentencePiece tokenizer built from legal text (32K vocabulary). Trained on a larger 6.6GB corpus covering both civil and criminal law for 1M steps across 8x A100 GPUs. Captures criminal law terminology that the adapted version misses.

Distilled: A 4-layer model producing 768-dimensional sentence embeddings with mean pooling. Optimized for semantic similarity search across legal documents - useful for finding related case law quickly.

Long-sequence (LSG): Extends the 512-token context limit to 4K or 16K tokens using Local-Sparse-Global attention. Critical for Italian legal text where a single sentence can exceed 512 tokens.

Downstream Tasks

The model has been evaluated on multiple legal NLP tasks:

Legal holding extraction: Extracting the key legal principle from court decisions. Evaluated on the ITA-CASEHOLD dataset - 1,101 Italian judgment-holding pairs from 2019-2022 civil law cases. Published at ICAIL 2023.

Rhetorical role classification: Automatically labeling sections of legal documents as facts, reasoning, rulings, or procedural history. Published at ASAIL@ICAIL 2023 using a TransformerOverBERT architecture.

Named entity recognition: Extracting parties, courts, statute references, and legal concepts from unstructured text.

Banking regulation Q&A: A 2024 paper (CLiC-it) applied the model in a multi-step prompt pipeline for answering questions about banking supervisory regulation.

Sentence similarity: The distilled variant enables fast semantic search across legal document collections for case law research.

A demo notebook demonstrates several downstream tasks.

Comparison to Other Legal NLP Models

Italian-Legal-BERT is the most adopted non-English legal BERT model:

Model	Language	Training Data	Downloads/mo	Likes
Legal-BERT (AUEB Athens)	English	12 GB EU/UK/US law	232,000	300
Italian-Legal-BERT	Italian	3.7 GB Italian civil law	~1,600	38
CaseHOLD Legal-BERT	English	US case law	16,000	-
Legal-BERT Portuguese	Portuguese	Brazilian law	~9	-
Legal-BERT Romanian	Romanian	Romanian law	~14	-

The English Legal-BERT (Chalkidis et al., EMNLP 2020) established that domain-adapted BERT consistently outperforms general BERT on legal tasks. Italian-Legal-BERT validates this finding for a civil law system with a language structurally different from English.

Strengths

Strong legal term prediction: 72.6% confidence on domain-specific vocabulary where general BERT fails
Full model family: Seven variants covering different use cases (speed, context length, embeddings)
Runs on a single GPU: The adapted model trains on a V100 and infers on consumer hardware
Companion dataset: ITA-CASEHOLD provides a standardized evaluation benchmark
Academic rigor: Published at EKAW 2022 and in Computer Law & Security Review (2024)
AFL-3.0 license: Permissive for both academic and commercial use

Weaknesses

Small training corpus: 3.7 GB is modest compared to English Legal-BERT's 12 GB
Civil law focus: The adapted model covers civil law only; criminal and administrative coverage requires the SC variant
512-token limit: Base model's context window is restrictive for Italian legal text; LSG variants address this but see minimal adoption (~1-85 downloads/month)
No decoder/generation: BERT is encoder-only - useful for classification and extraction but not text generation
Limited benchmarks: Detailed performance numbers are in the paywalled 2024 journal paper
Niche adoption: ~1,600 downloads/month reflects the specialized Italian legal NLP market
No multilingual coverage: Covers Italian only; practitioners working across EU jurisdictions need separate models

News: Italian-Legal-BERT Brings Domain-Specific NLP to Italian Law
Compare: Legal-BERT (English)

Sources: