Latin BERT (Bamman & Burns 2020)
HuggingFace-compatible packaging of the Latin BERT model from:
Bamman, D., & Burns, P.J. (2020). Latin BERT: A Contextual Language Model for Classical Philology. arXiv preprint arXiv:2009.10053.
The original model and training code are available at github.com/dbamman/latin-bert. This repo repackages the same weights for use with HuggingFace transformers.
Note: This is an experimental repackaging. If you encounter any issues, please open a thread in the Discussion tab.
Model Details
- Architecture: BERT-base (12 layers, 768 hidden, 12 attention heads)
- Parameters: ~111M
- Vocab size: 32,900 (SubwordTextEncoder)
- Max sequence length: 512
- Training data: Latin texts (see paper for details)
Install
pip install transformers torch
Usage
Basic: Get contextual embeddings
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"latincy/latin-bert", trust_remote_code=True
)
model = AutoModel.from_pretrained("latincy/latin-bert")
inputs = tokenizer("Gallia est omnis divisa in partes tres", return_tensors="pt")
outputs = model(**inputs)
# outputs.last_hidden_state: (batch, seq_len, 768)
Masked language model (fill-mask)
from transformers import AutoTokenizer, BertForMaskedLM
import torch
tokenizer = AutoTokenizer.from_pretrained(
"latincy/latin-bert", trust_remote_code=True
)
model = BertForMaskedLM.from_pretrained("latincy/latin-bert")
text = "Gallia est omnis [MASK] in partes tres"
inputs = tokenizer(text, return_tensors="pt")
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
with torch.no_grad():
logits = model(**inputs).logits
top5 = logits[0, mask_idx, :].topk(5).indices.squeeze()
for token_id in top5:
print(tokenizer.decode([token_id.item()]))
Custom Tokenizer
The original Latin BERT uses a tensor2tensor SubwordTextEncoder, not standard
WordPiece. This repo includes a faithful reimplementation as a HuggingFace
PreTrainedTokenizer β this is why trust_remote_code=True is required.
Verified against the original case studies from the paper:
POS tagging (Table 1)
| Treebank | Accuracy |
|---|---|
| Perseus | 95.2% |
| PROIEL | 98.2% |
| ITTB | 99.2% |
Masked word prediction (Table 3)
| Metric | Score |
|---|---|
| P@1 | 33.1% |
| P@10 | 62.2% |
| P@50 | 74.0% |
spaCy Integration
Works with spacy-transformers:
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "latincy/latin-bert"
[components.transformer.model.tokenizer_config]
trust_remote_code = true
use_fast = false
Changelog
v1.1.1 β Bug fix: add do_lower_case=True to tokenizer
The original Latin BERT vocabulary was trained on lowercased text. All original
case studies (POS tagging, WSD, infilling) explicitly called .lower() before
tokenizing. The HF PreTrainedTokenizer wrapper was missing this step, causing
uppercase characters to be escaped to their ASCII codepoints (e.g. C β
\67;), inflating token counts ~4x and producing embeddings the model was never
trained on. The tokenizer now lowercases input by default (do_lower_case=True),
matching the original pipeline behavior.
v1.1.0 β HuggingFace repackaging
Repackaged the original tensor2tensor SubwordTextEncoder tokenizer and
PyTorch weights as a HuggingFace PreTrainedTokenizer + safetensors model.
v1.0.0 β Original model
Bamman & Burns (2020) Latin BERT weights and tensor2tensor tokenizer.
Citation
@article{bamman2020latin,
title={Latin BERT: A Contextual Language Model for Classical Philology},
author={Bamman, David and Burns, Patrick J},
journal={arXiv preprint arXiv:2009.10053},
year={2020}
}
- Downloads last month
- 156