Italian-Legal-BERT

Italian-Legal-BERT is a 110M-parameter domain-adapted BERT model for Italian legal NLP, trained on 3.7GB of court decisions from Italy's National Jurisprudential Archive.

Italian-Legal-BERT

Italian-Legal-BERT is a domain-adapted BERT model purpose-built for Italian legal text. Created by Daniele Licari and Giovanni Comande at Scuola Superiore Sant'Anna in Pisa, it takes a strong general Italian BERT and continues pre-training on 3.7 gigabytes of civil law decisions from Italy's National Jurisprudential Archive. The result is a 110-million parameter model that correctly handles the archaic vocabulary, paragraph-length sentences, and rigid rhetorical structures of Italian court documents where general-purpose models fail.

TL;DR

  • Domain-adapted BERT trained on 3.7GB of Italian civil law from the National Jurisprudential Archive
  • Seven variants: base, from-scratch, distilled, long-sequence (4K/16K)
  • Applied to holding extraction, NER, rhetorical classification, legal Q&A
  • ~1,600 downloads/month on Hugging Face - leading non-English legal BERT
  • AFL-3.0 license, runs on a single GPU

Key Specifications

SpecificationDetails
ProviderScuola Superiore Sant'Anna, Pisa
AuthorsDaniele Licari, Giovanni Comande
Parameters~110M
ArchitectureBERT (domain-adapted) / CamemBERT-RoBERTa (from-scratch)
Base Modeldbmdz/bert-base-italian-xxl-cased (81GB Italian corpus)
Training Data3.7 GB Italian civil law (adapted); 6.6 GB civil + criminal (from-scratch)
Training Steps8.4M steps / 4 epochs (adapted); 1M steps (from-scratch)
Context Window512 tokens (base); 4,096 / 16,384 tokens (LSG variants)
Vocabulary32K tokens (from-scratch variant uses custom SentencePiece)
Hardware1x V100 16GB (adapted); 8x A100 40GB (from-scratch)
LicenseAFL-3.0 (Academic Free License)
ReleaseSeptember 2022 (EKAW), extended paper 2024 (Computer Law & Security Review)

Italian legal language (linguaggio giuridico) diverges sharply from standard Italian. Court decisions preserve Latin-derived terms - "ricorrente" (applicant), "convenuto" (defendant), "resistente" (respondent) - that rarely appear in web text or Wikipedia. Sentences routinely span entire paragraphs with deeply nested subordinate clauses. Each branch of law (civil, criminal, administrative) carries distinct vocabulary. And Italian judgments follow rigid rhetorical patterns: fatto (facts), diritto (legal reasoning), dispositivo (ruling).

General Italian BERT, trained on 81GB of Wikipedia and web crawl data, has near-zero coverage of this specialized vocabulary. Fill-mask tests demonstrate the gap clearly:

Prompt: "Il [MASK] ha chiesto revocarsi l'obbligo di pagamento" (The [MASK] requested the payment obligation be revoked)

ModelTop predictionConfidence
Italian-Legal-BERTricorrente (applicant)72.6%
Italian-Legal-BERTconvenuto (defendant)9.6%
Italian-Legal-BERTresistente (respondent)4.0%
General Italian BERTcomune (municipality)-

The legal model predicts the three most contextually appropriate legal parties. The general model guesses a common noun unrelated to the legal context.

Model Variants

The research team released a full family of models addressing different use cases:

ModelApproachTraining DataContextDownloads/mo
Italian-Legal-BERTDomain-adapted3.7 GB civil law512~1,600
Italian-Legal-BERT-SCFrom scratch6.6 GB civil + criminal512~40
distil-ita-legal-bertDistilled embeddingsSame as base512~590
lsg16k-Italian-Legal-BERTLong sequence3.7 GB civil16,384~85
lsg4k-Italian-Legal-BERTLong sequence3.7 GB civil4,096~1

Domain-adapted (base): Starts from dbmdz/bert-base-italian-xxl-cased and continues pre-training on legal text. This approach preserves general language understanding while specializing on legal vocabulary. Trained on a single V100 for 4 epochs.

From-scratch (SC): Uses a CamemBERT/RoBERTa architecture with a custom SentencePiece tokenizer built from legal text (32K vocabulary). Trained on a larger 6.6GB corpus covering both civil and criminal law for 1M steps across 8x A100 GPUs. Captures criminal law terminology that the adapted version misses.

Distilled: A 4-layer model producing 768-dimensional sentence embeddings with mean pooling. Optimized for semantic similarity search across legal documents - useful for finding related case law quickly.

Long-sequence (LSG): Extends the 512-token context limit to 4K or 16K tokens using Local-Sparse-Global attention. Critical for Italian legal text where a single sentence can exceed 512 tokens.

Downstream Tasks

The model has been evaluated on multiple legal NLP tasks:

Legal holding extraction: Extracting the key legal principle from court decisions. Evaluated on the ITA-CASEHOLD dataset - 1,101 Italian judgment-holding pairs from 2019-2022 civil law cases. Published at ICAIL 2023.

Rhetorical role classification: Automatically labeling sections of legal documents as facts, reasoning, rulings, or procedural history. Published at ASAIL@ICAIL 2023 using a TransformerOverBERT architecture.

Named entity recognition: Extracting parties, courts, statute references, and legal concepts from unstructured text.

Banking regulation Q&A: A 2024 paper (CLiC-it) applied the model in a multi-step prompt pipeline for answering questions about banking supervisory regulation.

Sentence similarity: The distilled variant enables fast semantic search across legal document collections for case law research.

A demo notebook demonstrates several downstream tasks.

Italian-Legal-BERT is the most adopted non-English legal BERT model:

ModelLanguageTraining DataDownloads/moLikes
Legal-BERT (AUEB Athens)English12 GB EU/UK/US law232,000300
Italian-Legal-BERTItalian3.7 GB Italian civil law~1,60038
CaseHOLD Legal-BERTEnglishUS case law16,000-
Legal-BERT PortuguesePortugueseBrazilian law~9-
Legal-BERT RomanianRomanianRomanian law~14-

The English Legal-BERT (Chalkidis et al., EMNLP 2020) established that domain-adapted BERT consistently outperforms general BERT on legal tasks. Italian-Legal-BERT validates this finding for a civil law system with a language structurally different from English.

Strengths

  • Strong legal term prediction: 72.6% confidence on domain-specific vocabulary where general BERT fails
  • Full model family: Seven variants covering different use cases (speed, context length, embeddings)
  • Runs on a single GPU: The adapted model trains on a V100 and infers on consumer hardware
  • Companion dataset: ITA-CASEHOLD provides a standardized evaluation benchmark
  • Academic rigor: Published at EKAW 2022 and in Computer Law & Security Review (2024)
  • AFL-3.0 license: Permissive for both academic and commercial use

Weaknesses

  • Small training corpus: 3.7 GB is modest compared to English Legal-BERT's 12 GB
  • Civil law focus: The adapted model covers civil law only; criminal and administrative coverage requires the SC variant
  • 512-token limit: Base model's context window is restrictive for Italian legal text; LSG variants address this but see minimal adoption (~1-85 downloads/month)
  • No decoder/generation: BERT is encoder-only - useful for classification and extraction but not text generation
  • Limited benchmarks: Detailed performance numbers are in the paywalled 2024 journal paper
  • Niche adoption: ~1,600 downloads/month reflects the specialized Italian legal NLP market
  • No multilingual coverage: Covers Italian only; practitioners working across EU jurisdictions need separate models

Sources:

✓ Last verified March 12, 2026

Italian-Legal-BERT
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.