AI & ML interests

OCR, Topic Modeling, etc.

Recent Activity

Organization Card

Law in British Periodicals

Law in British Periodicals is a collaborative digital humanities project investigating the representation of law, legal language, and heteroglossic discourse in 18th- and 19th-century British periodical literature. The project combines archival research with computational methods to trace how legal vocabulary, genres, and arguments circulated across the periodical press.

Team

Name Institution Role
Clifford B. Anderson Yale University
Corey Brady Southern Methodist University
Sophie Hao Boston University
Mark Schoenfield Vanderbilt University

Datasets

All datasets are private due to contractual restrictions. Each year in the corpus follows a three-stage structure:

Dataset Description
LawInBritishPeriodicals/[year] Source PDFs, one row per document
LawInBritishPeriodicals/[year]-images Page images rasterized at 150 DPI, one row per page, with source_file, page_number, and total_pages metadata
LawInBritishPeriodicals/[year]-ocr OCR transcriptions in Markdown format, generated with GLM-OCR
LawInBritishPeriodicals/[year]-classified Topic classifications with confidence scores, generated with Qwen2.5-7B-Instruct

Current holdings: 1770 Β· 1811

Pipeline

The project uses a fully reproducible, GPU-accelerated processing pipeline built on HuggingFace Jobs and uv scripts.

PDF documents
      β”‚
      β–Ό  pdf_to_images.py  (CPU)
Page images (150 DPI PNG)
      β”‚
      β–Ό  glm-ocr-v2.py  (L4 GPU)
OCR transcriptions (Markdown)
      β”‚
      β–Ό  classify_topics.py  (L4 GPU)
Topic classifications (JSON with labels + confidence scores)

Pipeline scripts are stored at LawInBritishPeriodicals/scripts. Full documentation is included in that repository's README.

OCR Model

Transcription uses GLM-OCR (zai-org, MIT License), a 0.9B parameter multimodal OCR model achieving 94.62% on OmniDocBench v1.5 β€” currently #1 overall. It handles 18th–19th century printed English effectively and supports multilingual output.

Topic Classification Model

Classification uses Qwen2.5-7B-Instruct with a project-specific taxonomy:

Label Scope
legal Statutes, trials, legal commentary, court reports
dramatic Theatre reviews, playbills, dramatic criticism
parliamentary Parliamentary proceedings, political speeches, elections
commercial Trade, prices, shipping, finance, advertisements
literary Poetry, fiction, literary criticism, essays
religious Sermons, moral philosophy, ecclesiastical affairs
natural Natural history, medicine, science, technology
other Does not fit any of the above

Pages may receive multiple labels. Each label carries a confidence score (0.0–1.0) and a one-sentence rationale.

Spaces

Space Description
LawInBritishPeriodicals/1770-dashboard Interactive topic classification dashboard (label frequency, score distributions, co-occurrence heatmap, timeline)

models 0

None public yet

datasets 0

None public yet