German Legal NER - ONNX INT4 Quantized

4-bit quantized ONNX version of elenanereiss/bert-german-ler for Named Entity Recognition in German legal texts.

Model Details


Base model	bert-base-german-cased fine-tuned on German LER
Source model	elenanereiss/bert-german-ler
Format	ONNX with 4-bit weight quantization (MatMulNBits, block_size=128, symmetric)
Model size	134 MB (down from 415 MB fp32)
Max sequence length	512 tokens
License	CC-BY-4.0

Performance

Metrics from the source model evaluated on the German LER test set:

	Precision	Recall	F1
Micro avg	0.945	0.964	0.955
Macro avg	0.89	0.89	0.89

Per-entity F1 (test set)

Entity	Code	F1	Entity	Code	F1
Law	GS	0.98	Court	GRT	0.98
Court decision	RS	0.97	Judge	RR	0.97
Contract	VT	0.96	Country	LD	0.96
Legal literature	LIT	0.96	Institution	INN	0.95
EU norm	EUN	0.95	Lawyer	AN	0.94
Person	PER	0.94	Brand	MRK	0.93
Company	UN	0.92	Organization	ORG	0.91
Ordinance	VO	0.90	Regulation	VS	0.86
City	ST	0.85	Street	STR	0.77
Landscape	LDS	0.61

Entity Types (19 classes)

Code	German	English	Share in dataset
GS	Gesetz	Law / Statute	34.53%
RS	Rechtsprechung	Court decision	23.46%
GRT	Gericht	Court	5.99%
LIT	Literatur	Legal literature	5.60%
VT	Vertrag	Contract / Treaty	5.34%
INN	Institution	Institution	4.09%
PER	Person	Person	3.26%
RR	Richter	Judge	2.83%
EUN	EU-Norm	EU legal norm	2.79%
LD	Land	Country / State	2.66%
ORG	Organisation	Organization	2.17%
UN	Unternehmen	Company	1.97%
VO	Verordnung	Ordinance	1.49%
ST	Stadt	City	1.31%
VS	Vorschrift	Regulation	1.13%
MRK	Marke	Brand	0.53%
LDS	Landschaft	Landscape / Region	0.37%
STR	Straße	Street	0.25%
AN	Anwalt	Lawyer	0.21%

Usage

Requirements

pip install onnxruntime transformers numpy

Inference

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
import json

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("mayflowergmbh/bert-german-ler-onnx-int4")
session = ort.InferenceSession("model_int4.onnx", providers=["CPUExecutionProvider"])

# Load label mapping
with open("config.json") as f:
    config = json.load(f)
id2label = config["id2label"]

# Tokenize input
text = "Herr Müller verstieß gegen § 36 Abs. 7 IfSG und wurde vom Bundesgerichtshof verurteilt."
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)

# Run inference
outputs = session.run(None, {
    "input_ids": inputs["input_ids"].astype(np.int64),
    "attention_mask": inputs["attention_mask"].astype(np.int64),
    "token_type_ids": inputs["token_type_ids"].astype(np.int64),
})

# Decode predictions
predictions = np.argmax(outputs[0], axis=-1)[0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

for token, pred_id in zip(tokens, predictions):
    label = id2label[str(pred_id)]
    if token not in ("[PAD]", "[CLS]", "[SEP]") and label != "O":
        print(f"{token:20s} {label}")

Output:

Müller               B-PER
§                    B-GS
36                   I-GS
Abs                  I-GS
.                    I-GS
7                    I-GS
I                    I-GS
##f                  I-GS
##SG                 I-GS
Bundes               B-GRT
##gerichtshof        I-GRT

Entity Extraction Helper

def extract_entities(text, tokenizer, session, id2label):
    inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)
    outputs = session.run(None, {
        "input_ids": inputs["input_ids"].astype(np.int64),
        "attention_mask": inputs["attention_mask"].astype(np.int64),
        "token_type_ids": inputs["token_type_ids"].astype(np.int64),
    })
    predictions = np.argmax(outputs[0], axis=-1)[0]
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

    entities = []
    current_entity = None
    current_tokens = []

    for token, pred_id in zip(tokens, predictions):
        if token in ("[PAD]", "[CLS]", "[SEP]"):
            continue
        label = id2label[str(pred_id)]

        if label.startswith("B-"):
            if current_entity:
                entities.append({
                    "entity": current_entity,
                    "text": tokenizer.convert_tokens_to_string(current_tokens).strip()
                })
            current_entity = label[2:]
            current_tokens = [token]
        elif label.startswith("I-") and current_entity == label[2:]:
            current_tokens.append(token)
        else:
            if current_entity:
                entities.append({
                    "entity": current_entity,
                    "text": tokenizer.convert_tokens_to_string(current_tokens).strip()
                })
                current_entity = None
                current_tokens = []

    if current_entity:
        entities.append({
            "entity": current_entity,
            "text": tokenizer.convert_tokens_to_string(current_tokens).strip()
        })

    return entities


entities = extract_entities(
    "Das Urteil des BGH vom 12.03.2021 (Az. III ZR 5/20) stützt sich auf § 280 Abs. 1 BGB.",
    tokenizer, session, id2label
)
for e in entities:
    print(f"[{e['entity']:>3s}] {e['text']}")

Output:

[GRT] BGH
[ RS] Az. III ZR 5 / 20
[ GS] § 280 Abs. 1 BGB

Batch Inference

texts = [
    "Der Kläger berief sich auf Art. 6 EMRK.",
    "Die Richterin Dr. Schmidt verwies auf das BVerfG-Urteil.",
]

inputs = tokenizer(texts, return_tensors="np", padding=True, truncation=True)
outputs = session.run(None, {
    "input_ids": inputs["input_ids"].astype(np.int64),
    "attention_mask": inputs["attention_mask"].astype(np.int64),
    "token_type_ids": inputs["token_type_ids"].astype(np.int64),
})

for i, text in enumerate(texts):
    predictions = np.argmax(outputs[0][i], axis=-1)
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][i])
    # ... process as above

Quantization Details

The model was quantized from the original fp32 ONNX export using ONNX Runtime's MatMulNBitsQuantizer:

from onnxruntime.quantization.matmul_nbits_quantizer import MatMulNBitsQuantizer
import onnx

model = onnx.load("model.onnx")
quant = MatMulNBitsQuantizer(
    model=model,
    block_size=128,
    is_symmetric=True,
    accuracy_level=4,
    bits=4,
)
quant.process()

	fp32 ONNX	INT4 ONNX
Size	415 MB	134 MB
Compression	1x	~3.1x
Quantization	-	4-bit symmetric, block_size=128

Citation

If you use this model, please cite the original LER dataset paper:

@inproceedings{leitner2020dataset,
  title={A Dataset of German Legal Documents for Named Entity Recognition},
  author={Leitner, Elena and Rehm, Georg and Moreno-Schneider, Juli{\'a}n},
  booktitle={Proceedings of the 12th Language Resources and Evaluation Conference},
  pages={4886--4893},
  year={2020},
  url={https://arxiv.org/abs/2003.13016}
}