Studying Sociolects in BP

Team

community

Activity Feed

AI & ML interests

OCR, Topic Modeling, etc.

Organization Card

Community About org cards

Law in British Periodicals

Law in British Periodicals is a collaborative digital humanities project investigating the representation of law, legal language, and heteroglossic discourse in 18th- and 19th-century British periodical literature. The project combines archival research with computational methods to trace how legal vocabulary, genres, and arguments circulated across the periodical press.

Team

Name	Institution	Role
Clifford B. Anderson	Yale University
Corey Brady	Southern Methodist University
Sophie Hao	Boston University
Mark Schoenfield	Vanderbilt University

Datasets

All datasets are private due to contractual restrictions. Each year in the corpus follows a three-stage structure:

Dataset	Description
`LawInBritishPeriodicals/[year]`	Source PDFs, one row per document
`LawInBritishPeriodicals/[year]-images`	Page images rasterized at 150 DPI, one row per page, with `source_file`, `page_number`, and `total_pages` metadata
`LawInBritishPeriodicals/[year]-ocr`	OCR transcriptions in Markdown format, generated with GLM-OCR
`LawInBritishPeriodicals/[year]-classified`	Topic classifications with confidence scores, generated with Qwen2.5-7B-Instruct

Current holdings: 1770 · 1811

Pipeline

The project uses a fully reproducible, GPU-accelerated processing pipeline built on HuggingFace Jobs and uv scripts.

PDF documents
      │
      ▼  pdf_to_images.py  (CPU)
Page images (150 DPI PNG)
      │
      ▼  glm-ocr-v2.py  (L4 GPU)
OCR transcriptions (Markdown)
      │
      ▼  classify_topics.py  (L4 GPU)
Topic classifications (JSON with labels + confidence scores)

Pipeline scripts are stored at LawInBritishPeriodicals/scripts. Full documentation is included in that repository's README.

OCR Model

Transcription uses GLM-OCR (zai-org, MIT License), a 0.9B parameter multimodal OCR model achieving 94.62% on OmniDocBench v1.5 — currently #1 overall. It handles 18th–19th century printed English effectively and supports multilingual output.

Topic Classification Model

Classification uses Qwen2.5-7B-Instruct with a project-specific taxonomy:

Label	Scope
`legal`	Statutes, trials, legal commentary, court reports
`dramatic`	Theatre reviews, playbills, dramatic criticism
`parliamentary`	Parliamentary proceedings, political speeches, elections
`commercial`	Trade, prices, shipping, finance, advertisements
`literary`	Poetry, fiction, literary criticism, essays
`religious`	Sermons, moral philosophy, ecclesiastical affairs
`natural`	Natural history, medicine, science, technology
`other`	Does not fit any of the above

Pages may receive multiple labels. Each label carries a confidence score (0.0–1.0) and a one-sentence rationale.

Spaces

Space	Description
`LawInBritishPeriodicals/1770-dashboard`	Interactive topic classification dashboard (label frequency, score distributions, co-occurrence heatmap, timeline)

models 0

None public yet

datasets 0

None public yet

AI & ML interests

Team members 4

Law in British Periodicals

Team

Datasets

Pipeline

OCR Model

Topic Classification Model

Spaces

models 0

datasets 0