ViT-B model for Illustrations
The vision encoder can be pretrained with autoregressive language modeling objective - no contrastive loss, no dual-tower architecture, and no extra text decoder.
The output includes both visual and textual data. The model encodes text token positions using ALiBi bias in attention.
Source data
- danbooru 2025-26
References
- 2108.12409
- 2604.12012
- 2605.00809
- Downloads last month
- 13
