Studying Sociolects in BP
AI & ML interests
OCR, Topic Modeling, etc.
Recent Activity
Law in British Periodicals
Law in British Periodicals is a collaborative digital humanities project investigating the representation of law, legal language, and heteroglossic discourse in 18th- and 19th-century British periodical literature. The project combines archival research with computational methods to trace how legal vocabulary, genres, and arguments circulated across the periodical press.
Team
| Name | Institution | Role |
|---|---|---|
| Clifford B. Anderson | Yale University | |
| Corey Brady | Southern Methodist University | |
| Sophie Hao | Boston University | |
| Mark Schoenfield | Vanderbilt University |
Datasets
All datasets are private due to contractual restrictions. Each year in the corpus follows a three-stage structure:
| Dataset | Description |
|---|---|
LawInBritishPeriodicals/[year] |
Source PDFs, one row per document |
LawInBritishPeriodicals/[year]-images |
Page images rasterized at 150 DPI, one row per page, with source_file, page_number, and total_pages metadata |
LawInBritishPeriodicals/[year]-ocr |
OCR transcriptions in Markdown format, generated with GLM-OCR |
LawInBritishPeriodicals/[year]-classified |
Topic classifications with confidence scores, generated with Qwen2.5-7B-Instruct |
Current holdings: 1770 Β· 1811
Pipeline
The project uses a fully reproducible, GPU-accelerated processing pipeline built on HuggingFace Jobs and uv scripts.
PDF documents
β
βΌ pdf_to_images.py (CPU)
Page images (150 DPI PNG)
β
βΌ glm-ocr-v2.py (L4 GPU)
OCR transcriptions (Markdown)
β
βΌ classify_topics.py (L4 GPU)
Topic classifications (JSON with labels + confidence scores)
Pipeline scripts are stored at LawInBritishPeriodicals/scripts. Full documentation is included in that repository's README.
OCR Model
Transcription uses GLM-OCR (zai-org, MIT License), a 0.9B parameter multimodal OCR model achieving 94.62% on OmniDocBench v1.5 β currently #1 overall. It handles 18thβ19th century printed English effectively and supports multilingual output.
Topic Classification Model
Classification uses Qwen2.5-7B-Instruct with a project-specific taxonomy:
| Label | Scope |
|---|---|
legal |
Statutes, trials, legal commentary, court reports |
dramatic |
Theatre reviews, playbills, dramatic criticism |
parliamentary |
Parliamentary proceedings, political speeches, elections |
commercial |
Trade, prices, shipping, finance, advertisements |
literary |
Poetry, fiction, literary criticism, essays |
religious |
Sermons, moral philosophy, ecclesiastical affairs |
natural |
Natural history, medicine, science, technology |
other |
Does not fit any of the above |
Pages may receive multiple labels. Each label carries a confidence score (0.0β1.0) and a one-sentence rationale.
Spaces
| Space | Description |
|---|---|
LawInBritishPeriodicals/1770-dashboard |
Interactive topic classification dashboard (label frequency, score distributions, co-occurrence heatmap, timeline) |