Latin BERT (Bamman & Burns 2020)

HuggingFace-compatible packaging of the Latin BERT model from:

Bamman, D., & Burns, P.J. (2020). Latin BERT: A Contextual Language Model for Classical Philology. arXiv preprint arXiv:2009.10053.

The original model and training code are available at github.com/dbamman/latin-bert. This repo repackages the same weights for use with HuggingFace transformers.

Note: This is an experimental repackaging. If you encounter any issues, please open a thread in the Discussion tab.

Model Details

Architecture: BERT-base (12 layers, 768 hidden, 12 attention heads)
Parameters: ~111M
Vocab size: 32,900 (SubwordTextEncoder)
Max sequence length: 512
Training data: Latin texts (see paper for details)

Install

pip install transformers torch

Usage

Basic: Get contextual embeddings

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "latincy/latin-bert", trust_remote_code=True
)
model = AutoModel.from_pretrained("latincy/latin-bert")

inputs = tokenizer("Gallia est omnis divisa in partes tres", return_tensors="pt")
outputs = model(**inputs)

# outputs.last_hidden_state: (batch, seq_len, 768)

Masked language model (fill-mask)

from transformers import AutoTokenizer, BertForMaskedLM
import torch

tokenizer = AutoTokenizer.from_pretrained(
    "latincy/latin-bert", trust_remote_code=True
)
model = BertForMaskedLM.from_pretrained("latincy/latin-bert")

text = "Gallia est omnis [MASK] in partes tres"
inputs = tokenizer(text, return_tensors="pt")
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    logits = model(**inputs).logits

top5 = logits[0, mask_idx, :].topk(5).indices.squeeze()
for token_id in top5:
    print(tokenizer.decode([token_id.item()]))

Custom Tokenizer

The original Latin BERT uses a tensor2tensor SubwordTextEncoder, not standard WordPiece. This repo includes a faithful reimplementation as a HuggingFace PreTrainedTokenizer — this is why trust_remote_code=True is required.

Verified against the original case studies from the paper:

POS tagging (Table 1)

Treebank	Accuracy
Perseus	95.2%
PROIEL	98.2%
ITTB	99.2%

Masked word prediction (Table 3)

Metric	Score
P@1	33.1%
P@10	62.2%
P@50	74.0%

spaCy Integration

Works with spacy-transformers:

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "latincy/latin-bert"

[components.transformer.model.tokenizer_config]
trust_remote_code = true
use_fast = false

Changelog

v1.1.1 — Bug fix: add `do_lower_case=True` to tokenizer

The original Latin BERT vocabulary was trained on lowercased text. All original case studies (POS tagging, WSD, infilling) explicitly called .lower() before tokenizing. The HF PreTrainedTokenizer wrapper was missing this step, causing uppercase characters to be escaped to their ASCII codepoints (e.g. C → \67;), inflating token counts ~4x and producing embeddings the model was never trained on. The tokenizer now lowercases input by default (do_lower_case=True), matching the original pipeline behavior.

v1.1.0 — HuggingFace repackaging

Repackaged the original tensor2tensor SubwordTextEncoder tokenizer and PyTorch weights as a HuggingFace PreTrainedTokenizer + safetensors model.

v1.0.0 — Original model

Bamman & Burns (2020) Latin BERT weights and tensor2tensor tokenizer.

Citation

@article{bamman2020latin,
  title={Latin BERT: A Contextual Language Model for Classical Philology},
  author={Bamman, David and Burns, Patrick J},
  journal={arXiv preprint arXiv:2009.10053},
  year={2020}
}

Downloads last month: 156

Safetensors

Model size

0.1B params

Tensor type

F32

Paper for latincy/latin-bert

Latin BERT: A Contextual Language Model for Classical Philology

Paper • 2009.10053 • Published Sep 21, 2020