language: en tags: - vision - text - multimodal - comics - page-classification - bert license: mit

CoSMo v4 (Comic Stream Modeling - Page Classifier)

CoSMo v4 is a highly specialized multimodal classifier designed to categorize pages within a comic book archive into distinct structural classes (e.g., story, cover, advertisement, credits).

It represents "Stage 2" of the Comic Analysis Framework v2.0, acting as the critical gatekeeper that filters raw comic archives down to pure narrative content for downstream sequence modeling.

This v4 iteration introduces the BookBERTMultimodal2 architecture, which replaces standard Convolutional feature extractors with modern Vision-Language models, achieving state-of-the-art accuracy on unstructured comic data.

Model Architecture

CoSMo v4 is based on the BookBERTMultimodal2 class. It treats a comic book as a "sequence" of pages and uses a Transformer encoder to understand the context of a page based on its position in the book.

Visual Features (1152-dim): Extracted using SigLIP (google/siglip-so400m-patch14-384).
Text Features (1024-dim): Extracted from OCR text using Qwen-Embedding (Qwen/Qwen3-Embedding-0.6B).
Projections: Deep MLP projection layers (Dim -> 3840 -> 1920 -> 768) align both visual and text features into a common 768-dim space.
Contextual Encoding: A 4-layer, 4-head BERT Encoder (Transformers.BertModel) processes the combined features across the entire length of the comic book, allowing the model to understand that an advertisement usually follows a story page, or credits appear at the end.
Classification Head: A deep sequential classifier maps the contextualized 768-dim token back to one of 9 distinct classes.

Output Classes

The model predicts one of 9 labels for every page:

advertisement
cover
story (The primary narrative content)
textstory
first-page
credits
art (Splash pages, pin-ups)
text (Editorial text)
back_cover

Usage

Because CoSMo v4 requires pre-computed SigLIP and Qwen embeddings, inference is typically a two-step process. The complete codebase for embedding generation and Zarr-based inference is available in the Comic Analysis GitHub Repository under src/cosmo/.

Quick Start Inference Snippet

If you already have your visual (1152-d) and text (1024-d) embeddings for a sequence of pages, you can run inference like this:

import torch
import torch.nn as nn
from transformers import BertConfig, BertModel

# 1. Define Architecture (Must match exactly)
class BookBERT(nn.Module):
    def __init__(self, bert_input=768, num_classes=9, hidden_dim=512, dropout_p=0.0):
        super().__init__()
        config = BertConfig(
            hidden_size=bert_input, num_hidden_layers=4, num_attention_heads=4,
            intermediate_size=bert_input * 4, max_position_embeddings=1024
        )
        self.bert_encoder = BertModel(config)
        self.classifier = nn.Sequential(
            nn.Linear(bert_input, hidden_dim),       
            nn.Linear(hidden_dim, hidden_dim // 2),  
            nn.LayerNorm(hidden_dim // 2),           
            nn.GELU(),                               
            nn.Dropout(dropout_p),                   
            nn.Linear(hidden_dim // 2, hidden_dim // 4), 
            nn.LayerNorm(hidden_dim // 4),           
            nn.GELU(),                               
            nn.Dropout(dropout_p),                   
            nn.Linear(hidden_dim // 4, num_classes)  
        )

class BookBERTMultimodal2(BookBERT):
    def __init__(self, textual_dim=1024, visual_dim=1152, bert_dim=768, classes=9):
        super().__init__(bert_input=bert_dim, num_classes=classes, hidden_dim=512, dropout_p=0.0)
        
        sz1_v = (visual_dim + bert_dim) * 2
        self.visual_projection = nn.Sequential(
            nn.Linear(visual_dim, sz1_v), nn.LayerNorm(sz1_v), nn.GELU(), nn.Dropout(0.0),
            nn.Linear(sz1_v, sz1_v//2), nn.LayerNorm(sz1_v//2), nn.GELU(), nn.Dropout(0.0),
            nn.Linear(sz1_v//2, bert_dim)
        )
        
        sz1_t = (textual_dim + bert_dim) * 2
        self.textual_projection = nn.Sequential(
            nn.Linear(textual_dim, sz1_t), nn.LayerNorm(sz1_t), nn.GELU(), nn.Dropout(0.0),
            nn.Linear(sz1_t, sz1_t//2), nn.LayerNorm(sz1_t//2), nn.GELU(), nn.Dropout(0.0),
            nn.Linear(sz1_t//2, bert_dim)
        )
        self.norm = nn.LayerNorm(bert_dim)

    def forward(self, textual_features, visual_features):
        batch_size, seq_len, _ = textual_features.shape
        mask = torch.ones((batch_size, seq_len), device=textual_features.device)
        
        t_norm = self.norm(self.textual_projection(textual_features))
        v_norm = self.norm(self.visual_projection(visual_features))
        
        combined = torch.stack([t_norm, v_norm], dim=2).view(batch_size, seq_len * 2, -1)
        exp_mask = mask.unsqueeze(2).expand(-1, -1, 2).reshape(batch_size, seq_len * 2)
        
        bert_out = self.bert_encoder(inputs_embeds=combined, attention_mask=exp_mask)
        reshaped = bert_out.last_hidden_state.view(batch_size, seq_len, 2, -1)
        return self.classifier(reshaped[:, :, -1, :])

# 2. Load Model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BookBERTMultimodal2().to(device)

state_dict = torch.hub.load_state_dict_from_url(
    "https://huggingface.co/RichardScottOZ/cosmo-v4/resolve/main/best_Multimodal_MultiToken_v4.pt",
    map_location=device
)
if 'model_state_dict' in state_dict:
    state_dict = state_dict['model_state_dict']
model.load_state_dict(state_dict, strict=True)
model.eval()

# 3. Inference (Example: 1 comic book containing 24 pages)
# visual_embeddings shape: (1, 24, 1152) -> From SigLIP
# text_embeddings shape: (1, 24, 1024) -> From Qwen
visual_embs = torch.randn(1, 24, 1152).to(device)
text_embs = torch.randn(1, 24, 1024).to(device)

with torch.inference_mode():
    logits = model(text_embs, visual_embs)
    predictions = torch.argmax(logits, dim=-1).squeeze(0)

class_names = ["advertisement", "cover", "story", "textstory", "first-page", "credits", "art", "text", "back_cover"]
for page_num, pred_idx in enumerate(predictions):
    print(f"Page {page_num}: {class_names[pred_idx]}")

Intended Use

This model is designed to process entire comic books/issues as a single sequence. Due to the positional embeddings in the BERT encoder, feeding it pages completely out of order or feeding it a single page at a time without context will degrade performance.

Note: The model has a hard limit of 1024 tokens, equating to 512 pages per forward pass. For massive omnibuses, chunking is required.

Citation

If you use this model or the framework, please reference the Comic Analysis GitHub Repository.

Example of Cosmov4 predictions


--- Predictions (First 25) ---
Page ID                                                      | Label
--------------------------------------------------------------------------------
#Guardian 001_#Guardian 001 - p000.jpg                       | cover
#Guardian 001_#Guardian 001 - p001.jpg                       | text
#Guardian 001_#Guardian 001 - p002.jpg                       | story
#Guardian 001_#Guardian 001 - p003.jpg                       | story
#Guardian 001_#Guardian 001 - p004.jpg                       | story
#Guardian 001_#Guardian 001 - p005.jpg                       | story
#Guardian 001_#Guardian 001 - p006.jpg                       | story
#Guardian 001_#Guardian 001 - p007.jpg                       | story
#Guardian 001_#Guardian 001 - p008.jpg                       | story
#Guardian 001_#Guardian 001 - p009.jpg                       | advertisement
#Guardian 001_#Guardian 001 - p010.jpg                       | story
#Guardian 001_#Guardian 001 - p011.jpg                       | story
#Guardian 001_#Guardian 001 - p012.jpg                       | story
#Guardian 001_#Guardian 001 - p013.jpg                       | story
#Guardian 001_#Guardian 001 - p014.jpg                       | story
#Guardian 001_#Guardian 001 - p015.jpg                       | story
#Guardian 001_#Guardian 001 - p016.jpg                       | story
#Guardian 001_#Guardian 001 - p017.jpg                       | story
#Guardian 001_#Guardian 001 - p018.jpg                       | story
#Guardian 001_#Guardian 001 - p019.jpg                       | story
#Guardian 001_#Guardian 001 - p020.jpg                       | story
#Guardian 001_#Guardian 001 - p021.jpg                       | story
#Guardian 001_#Guardian 001 - p022.jpg                       | story
#Guardian 001_#Guardian 001 - p023.jpg                       | story
#Guardian 001_#Guardian 001 - p024.jpg                       | text

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including RichardScottOZ/cosmo-v4

Comics

Collection

6 items • Updated 6 days ago