ViT-B model for Illustrations

The vision encoder can be pretrained with autoregressive language modeling objective - no contrastive loss, no dual-tower architecture, and no extra text decoder.

The output includes both visual and textual data. The model encodes text token positions using ALiBi bias in attention.

Source data

danbooru 2025-26

References

2108.12409
2604.12012
2605.00809

Downloads last month: 13