A Dataset of German Legal Documents for Named Entity Recognition
Paper • 2003.13016 • Published
4-bit quantized ONNX version of elenanereiss/bert-german-ler for Named Entity Recognition in German legal texts.
| Base model | bert-base-german-cased fine-tuned on German LER |
| Source model | elenanereiss/bert-german-ler |
| Format | ONNX with 4-bit weight quantization (MatMulNBits, block_size=128, symmetric) |
| Model size | 134 MB (down from 415 MB fp32) |
| Max sequence length | 512 tokens |
| License | CC-BY-4.0 |
Metrics from the source model evaluated on the German LER test set:
| Precision | Recall | F1 | |
|---|---|---|---|
| Micro avg | 0.945 | 0.964 | 0.955 |
| Macro avg | 0.89 | 0.89 | 0.89 |
| Entity | Code | F1 | Entity | Code | F1 |
|---|---|---|---|---|---|
| Law | GS | 0.98 | Court | GRT | 0.98 |
| Court decision | RS | 0.97 | Judge | RR | 0.97 |
| Contract | VT | 0.96 | Country | LD | 0.96 |
| Legal literature | LIT | 0.96 | Institution | INN | 0.95 |
| EU norm | EUN | 0.95 | Lawyer | AN | 0.94 |
| Person | PER | 0.94 | Brand | MRK | 0.93 |
| Company | UN | 0.92 | Organization | ORG | 0.91 |
| Ordinance | VO | 0.90 | Regulation | VS | 0.86 |
| City | ST | 0.85 | Street | STR | 0.77 |
| Landscape | LDS | 0.61 |
| Code | German | English | Share in dataset |
|---|---|---|---|
| GS | Gesetz | Law / Statute | 34.53% |
| RS | Rechtsprechung | Court decision | 23.46% |
| GRT | Gericht | Court | 5.99% |
| LIT | Literatur | Legal literature | 5.60% |
| VT | Vertrag | Contract / Treaty | 5.34% |
| INN | Institution | Institution | 4.09% |
| PER | Person | Person | 3.26% |
| RR | Richter | Judge | 2.83% |
| EUN | EU-Norm | EU legal norm | 2.79% |
| LD | Land | Country / State | 2.66% |
| ORG | Organisation | Organization | 2.17% |
| UN | Unternehmen | Company | 1.97% |
| VO | Verordnung | Ordinance | 1.49% |
| ST | Stadt | City | 1.31% |
| VS | Vorschrift | Regulation | 1.13% |
| MRK | Marke | Brand | 0.53% |
| LDS | Landschaft | Landscape / Region | 0.37% |
| STR | Straße | Street | 0.25% |
| AN | Anwalt | Lawyer | 0.21% |
pip install onnxruntime transformers numpy
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
import json
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("mayflowergmbh/bert-german-ler-onnx-int4")
session = ort.InferenceSession("model_int4.onnx", providers=["CPUExecutionProvider"])
# Load label mapping
with open("config.json") as f:
config = json.load(f)
id2label = config["id2label"]
# Tokenize input
text = "Herr Müller verstieß gegen § 36 Abs. 7 IfSG und wurde vom Bundesgerichtshof verurteilt."
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)
# Run inference
outputs = session.run(None, {
"input_ids": inputs["input_ids"].astype(np.int64),
"attention_mask": inputs["attention_mask"].astype(np.int64),
"token_type_ids": inputs["token_type_ids"].astype(np.int64),
})
# Decode predictions
predictions = np.argmax(outputs[0], axis=-1)[0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
for token, pred_id in zip(tokens, predictions):
label = id2label[str(pred_id)]
if token not in ("[PAD]", "[CLS]", "[SEP]") and label != "O":
print(f"{token:20s} {label}")
Output:
Müller B-PER
§ B-GS
36 I-GS
Abs I-GS
. I-GS
7 I-GS
I I-GS
##f I-GS
##SG I-GS
Bundes B-GRT
##gerichtshof I-GRT
def extract_entities(text, tokenizer, session, id2label):
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)
outputs = session.run(None, {
"input_ids": inputs["input_ids"].astype(np.int64),
"attention_mask": inputs["attention_mask"].astype(np.int64),
"token_type_ids": inputs["token_type_ids"].astype(np.int64),
})
predictions = np.argmax(outputs[0], axis=-1)[0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
entities = []
current_entity = None
current_tokens = []
for token, pred_id in zip(tokens, predictions):
if token in ("[PAD]", "[CLS]", "[SEP]"):
continue
label = id2label[str(pred_id)]
if label.startswith("B-"):
if current_entity:
entities.append({
"entity": current_entity,
"text": tokenizer.convert_tokens_to_string(current_tokens).strip()
})
current_entity = label[2:]
current_tokens = [token]
elif label.startswith("I-") and current_entity == label[2:]:
current_tokens.append(token)
else:
if current_entity:
entities.append({
"entity": current_entity,
"text": tokenizer.convert_tokens_to_string(current_tokens).strip()
})
current_entity = None
current_tokens = []
if current_entity:
entities.append({
"entity": current_entity,
"text": tokenizer.convert_tokens_to_string(current_tokens).strip()
})
return entities
entities = extract_entities(
"Das Urteil des BGH vom 12.03.2021 (Az. III ZR 5/20) stützt sich auf § 280 Abs. 1 BGB.",
tokenizer, session, id2label
)
for e in entities:
print(f"[{e['entity']:>3s}] {e['text']}")
Output:
[GRT] BGH
[ RS] Az. III ZR 5 / 20
[ GS] § 280 Abs. 1 BGB
texts = [
"Der Kläger berief sich auf Art. 6 EMRK.",
"Die Richterin Dr. Schmidt verwies auf das BVerfG-Urteil.",
]
inputs = tokenizer(texts, return_tensors="np", padding=True, truncation=True)
outputs = session.run(None, {
"input_ids": inputs["input_ids"].astype(np.int64),
"attention_mask": inputs["attention_mask"].astype(np.int64),
"token_type_ids": inputs["token_type_ids"].astype(np.int64),
})
for i, text in enumerate(texts):
predictions = np.argmax(outputs[0][i], axis=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][i])
# ... process as above
The model was quantized from the original fp32 ONNX export using ONNX Runtime's MatMulNBitsQuantizer:
from onnxruntime.quantization.matmul_nbits_quantizer import MatMulNBitsQuantizer
import onnx
model = onnx.load("model.onnx")
quant = MatMulNBitsQuantizer(
model=model,
block_size=128,
is_symmetric=True,
accuracy_level=4,
bits=4,
)
quant.process()
| fp32 ONNX | INT4 ONNX | |
|---|---|---|
| Size | 415 MB | 134 MB |
| Compression | 1x | ~3.1x |
| Quantization | - | 4-bit symmetric, block_size=128 |
If you use this model, please cite the original LER dataset paper:
@inproceedings{leitner2020dataset,
title={A Dataset of German Legal Documents for Named Entity Recognition},
author={Leitner, Elena and Rehm, Georg and Moreno-Schneider, Juli{\'a}n},
booktitle={Proceedings of the 12th Language Resources and Evaluation Conference},
pages={4886--4893},
year={2020},
url={https://arxiv.org/abs/2003.13016}
}
Base model
google-bert/bert-base-german-cased