language: en tags: - vision - text - multimodal - comics - page-classification - bert license: mit
CoSMo v4 (Comic Stream Modeling - Page Classifier)
CoSMo v4 is a highly specialized multimodal classifier designed to categorize pages within a comic book archive into distinct structural classes (e.g., story, cover, advertisement, credits).
It represents "Stage 2" of the Comic Analysis Framework v2.0, acting as the critical gatekeeper that filters raw comic archives down to pure narrative content for downstream sequence modeling.
This v4 iteration introduces the BookBERTMultimodal2 architecture, which replaces standard Convolutional feature extractors with modern Vision-Language models, achieving state-of-the-art accuracy on unstructured comic data.
Model Architecture
CoSMo v4 is based on the BookBERTMultimodal2 class. It treats a comic book as a "sequence" of pages and uses a Transformer encoder to understand the context of a page based on its position in the book.
- Visual Features (
1152-dim): Extracted using SigLIP (google/siglip-so400m-patch14-384). - Text Features (
1024-dim): Extracted from OCR text using Qwen-Embedding (Qwen/Qwen3-Embedding-0.6B). - Projections: Deep MLP projection layers
(Dim -> 3840 -> 1920 -> 768)align both visual and text features into a common768-dimspace. - Contextual Encoding: A 4-layer, 4-head BERT Encoder (
Transformers.BertModel) processes the combined features across the entire length of the comic book, allowing the model to understand that an advertisement usually follows a story page, or credits appear at the end. - Classification Head: A deep sequential classifier maps the contextualized
768-dimtoken back to one of 9 distinct classes.
Output Classes
The model predicts one of 9 labels for every page:
advertisementcoverstory(The primary narrative content)textstoryfirst-pagecreditsart(Splash pages, pin-ups)text(Editorial text)back_cover
Usage
Because CoSMo v4 requires pre-computed SigLIP and Qwen embeddings, inference is typically a two-step process. The complete codebase for embedding generation and Zarr-based inference is available in the Comic Analysis GitHub Repository under src/cosmo/.
Quick Start Inference Snippet
If you already have your visual (1152-d) and text (1024-d) embeddings for a sequence of pages, you can run inference like this:
import torch
import torch.nn as nn
from transformers import BertConfig, BertModel
# 1. Define Architecture (Must match exactly)
class BookBERT(nn.Module):
def __init__(self, bert_input=768, num_classes=9, hidden_dim=512, dropout_p=0.0):
super().__init__()
config = BertConfig(
hidden_size=bert_input, num_hidden_layers=4, num_attention_heads=4,
intermediate_size=bert_input * 4, max_position_embeddings=1024
)
self.bert_encoder = BertModel(config)
self.classifier = nn.Sequential(
nn.Linear(bert_input, hidden_dim),
nn.Linear(hidden_dim, hidden_dim // 2),
nn.LayerNorm(hidden_dim // 2),
nn.GELU(),
nn.Dropout(dropout_p),
nn.Linear(hidden_dim // 2, hidden_dim // 4),
nn.LayerNorm(hidden_dim // 4),
nn.GELU(),
nn.Dropout(dropout_p),
nn.Linear(hidden_dim // 4, num_classes)
)
class BookBERTMultimodal2(BookBERT):
def __init__(self, textual_dim=1024, visual_dim=1152, bert_dim=768, classes=9):
super().__init__(bert_input=bert_dim, num_classes=classes, hidden_dim=512, dropout_p=0.0)
sz1_v = (visual_dim + bert_dim) * 2
self.visual_projection = nn.Sequential(
nn.Linear(visual_dim, sz1_v), nn.LayerNorm(sz1_v), nn.GELU(), nn.Dropout(0.0),
nn.Linear(sz1_v, sz1_v//2), nn.LayerNorm(sz1_v//2), nn.GELU(), nn.Dropout(0.0),
nn.Linear(sz1_v//2, bert_dim)
)
sz1_t = (textual_dim + bert_dim) * 2
self.textual_projection = nn.Sequential(
nn.Linear(textual_dim, sz1_t), nn.LayerNorm(sz1_t), nn.GELU(), nn.Dropout(0.0),
nn.Linear(sz1_t, sz1_t//2), nn.LayerNorm(sz1_t//2), nn.GELU(), nn.Dropout(0.0),
nn.Linear(sz1_t//2, bert_dim)
)
self.norm = nn.LayerNorm(bert_dim)
def forward(self, textual_features, visual_features):
batch_size, seq_len, _ = textual_features.shape
mask = torch.ones((batch_size, seq_len), device=textual_features.device)
t_norm = self.norm(self.textual_projection(textual_features))
v_norm = self.norm(self.visual_projection(visual_features))
combined = torch.stack([t_norm, v_norm], dim=2).view(batch_size, seq_len * 2, -1)
exp_mask = mask.unsqueeze(2).expand(-1, -1, 2).reshape(batch_size, seq_len * 2)
bert_out = self.bert_encoder(inputs_embeds=combined, attention_mask=exp_mask)
reshaped = bert_out.last_hidden_state.view(batch_size, seq_len, 2, -1)
return self.classifier(reshaped[:, :, -1, :])
# 2. Load Model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BookBERTMultimodal2().to(device)
state_dict = torch.hub.load_state_dict_from_url(
"https://huggingface.co/RichardScottOZ/cosmo-v4/resolve/main/best_Multimodal_MultiToken_v4.pt",
map_location=device
)
if 'model_state_dict' in state_dict:
state_dict = state_dict['model_state_dict']
model.load_state_dict(state_dict, strict=True)
model.eval()
# 3. Inference (Example: 1 comic book containing 24 pages)
# visual_embeddings shape: (1, 24, 1152) -> From SigLIP
# text_embeddings shape: (1, 24, 1024) -> From Qwen
visual_embs = torch.randn(1, 24, 1152).to(device)
text_embs = torch.randn(1, 24, 1024).to(device)
with torch.inference_mode():
logits = model(text_embs, visual_embs)
predictions = torch.argmax(logits, dim=-1).squeeze(0)
class_names = ["advertisement", "cover", "story", "textstory", "first-page", "credits", "art", "text", "back_cover"]
for page_num, pred_idx in enumerate(predictions):
print(f"Page {page_num}: {class_names[pred_idx]}")
Intended Use
This model is designed to process entire comic books/issues as a single sequence. Due to the positional embeddings in the BERT encoder, feeding it pages completely out of order or feeding it a single page at a time without context will degrade performance.
Note: The model has a hard limit of 1024 tokens, equating to 512 pages per forward pass. For massive omnibuses, chunking is required.
Citation
If you use this model or the framework, please reference the Comic Analysis GitHub Repository.
Example of Cosmov4 predictions
--- Predictions (First 25) ---
Page ID | Label
--------------------------------------------------------------------------------
#Guardian 001_#Guardian 001 - p000.jpg | cover
#Guardian 001_#Guardian 001 - p001.jpg | text
#Guardian 001_#Guardian 001 - p002.jpg | story
#Guardian 001_#Guardian 001 - p003.jpg | story
#Guardian 001_#Guardian 001 - p004.jpg | story
#Guardian 001_#Guardian 001 - p005.jpg | story
#Guardian 001_#Guardian 001 - p006.jpg | story
#Guardian 001_#Guardian 001 - p007.jpg | story
#Guardian 001_#Guardian 001 - p008.jpg | story
#Guardian 001_#Guardian 001 - p009.jpg | advertisement
#Guardian 001_#Guardian 001 - p010.jpg | story
#Guardian 001_#Guardian 001 - p011.jpg | story
#Guardian 001_#Guardian 001 - p012.jpg | story
#Guardian 001_#Guardian 001 - p013.jpg | story
#Guardian 001_#Guardian 001 - p014.jpg | story
#Guardian 001_#Guardian 001 - p015.jpg | story
#Guardian 001_#Guardian 001 - p016.jpg | story
#Guardian 001_#Guardian 001 - p017.jpg | story
#Guardian 001_#Guardian 001 - p018.jpg | story
#Guardian 001_#Guardian 001 - p019.jpg | story
#Guardian 001_#Guardian 001 - p020.jpg | story
#Guardian 001_#Guardian 001 - p021.jpg | story
#Guardian 001_#Guardian 001 - p022.jpg | story
#Guardian 001_#Guardian 001 - p023.jpg | story
#Guardian 001_#Guardian 001 - p024.jpg | text