ViT-B model for Illustrations

The vision encoder can be pretrained with autoregressive language modeling objective - no contrastive loss, no dual-tower architecture, and no extra text decoder.

The output includes both visual and textual data. The model encodes text token positions using ALiBi bias in attention.

Source data

  • danbooru 2025-26

References

  • 2108.12409
  • 2604.12012
  • 2605.00809
Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support