Instructions to use Rohanify/Anime-Elite-V2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Rohanify/Anime-Elite-V2 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Rohanify/Anime-Elite-V2", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
Anime-Elite V2 β From-Scratch Text-to-Anime-Face Diffusion
Built for mid-tier devices. Runs on 2 GB VRAM. Trained on a single RTX 5080. No pretrained components anywhere in the stack.
Recommended First Prompt: python inference.py --prompt "1girl, portrait, long hair, looking at viewer, red eyes" --seed 56 --steps 190 --guidance 2.4
V2 is the follow-up to Anime-Elite-V1. Same from-scratch philosophy (no pretrained VAE, no LoRA over base SD, no fine-tuning). What changed is reliability β V1 had high peaks but inconsistent average samples. V2 smooths that out with EMA-averaged weights, so the floor rises significantly and the seed-to-seed variance drops.
If you want to use this model: head to Files and versions and download one of the .pt checkpoints + inference.py. That's it.
What changed from V1
- EMA weights (decay 0.999) β model samples come from an exponentially averaged shadow of the training weights instead of the noisy live weights. Single biggest quality lever for small diffusion. This is what fixed V1's inconsistency.
- 17k training images (V1 used 10k). More diversity, more characters, more tag combinations covered.
- 96Γ96 throughout β config consistent everywhere (V1 had mixed metadata).
- Same 67M-parameter architecture. Same Danbooru-style tag conditioning. Same MIT license.
Showcase
The samples below are all from V2, single-checkpoint, no post-processing, no upscaler. Just the model output as it comes out at 96Γ96.
These are cherry-picked but not unreachable β the reference seeds below reproduce these compositions consistently across GPUs (with minor bf16 numerical drift on different architectures).
Best sampling config
After a lot of seed sweeping, this is the config that gave the strongest results:
prompt: 1girl,portrait,long hair,<color> eyes,<color> hair
guidance: 1.8 β 2.5
DDIM steps: 50 β 200 (200 = sharper, 50 = faster)
seeds: 26, 56 (peak compositions on the reference checkpoint)
Higher guidance pushes tag adherence but can wash out colors. Stay in 1.8β2.5 for the cleanest results.
How to use
pip install torch diffusers pillow
python inference.py --ckpt ckpt_e040.pt --prompt "1girl,portrait,long hair,red eyes" --seed 26
That's it. The script auto-detects EMA weights inside the checkpoint and uses them. Output saves to out/.
For batch sampling or to scan many seeds, see seed_sweep.py linked in the repo (generates a labeled grid of N seeds in one pass).
Tag prompting
This model uses Danbooru-style tags, comma-separated. Exact matches only β no CLIP, no natural language understanding.
1girlnotgirl.red hairnotred hairs.blue eyesnotblue eye.- Stack 4β8 tags per prompt. Anchor with high-frequency tags first (
1girl,portrait,long hair) then add specifics. - Common tags in the vocab:
1girl,portrait,long hair,short hair,blue eyes,red eyes,green eyes,purple eyes,pink eyes,red hair,blue hair,brown hair,white hair,pink hair,purple hair,green hair,smile,blush,looking at viewer,floral background,choker,closed mouth,bangs. - Print
vocabafter loading to see the full 512-tag list.
Limitations
Honest list:
- 96Γ96 only. Small. Pair with a Real-ESRGAN or similar upscaler for usable sizes. The model produces clean 96px output; upscaling is a separate concern.
- Heavy female bias. Dataset is ~90%+ female anime characters.
1boymostly produces feminine-coded outputs because the model rarely saw counter-examples. - Tag exact-match. Typos = silent ignore. Always check the printed
Matched tagsline. - Hit rate ~20β25%. Generate 5β10 candidates per prompt and pick the best β this is how all small diffusion models are used. Don't judge from a single sample.
- No safety checker. Anime faces, no real people. Use at your own discretion.
What's in this repo
ckpt_e040.pt,ckpt_e045.pt,ckpt_e050.ptβ EMA-weight checkpoints from the final stretch of training. e040 is the recommended default. Each ~270 MB.inference.pyβ single-file CLI inference scriptpeak1.png...peak6.pngβ direct model outputs, no upscaling- This README
Training details
- Optimizer: AdamW, lr=1e-4 constant, betas=(0.9, 0.999), wd=1e-6
- Scheduler: DDPM, 1000 steps, squaredcos_cap_v2
- Noise prediction loss (plain MSE β no min-SNR weighting after V2 experiments showed it hurt at this scale)
- CFG dropout: 10% null condition during training
- Mixed precision: bf16 autocast
- Batch size: 16
- 50 epochs Γ 1062 steps = ~53k total steps
- EMA: decay 0.999, updated every step
- Hardware: RTX 5080 16 GB, Windows 11, ~3 hours wall clock
Acknowledgements
- Dataset: puruchinera/anime-faces-256
- Architecture: HuggingFace
diffusers(UNet2DConditionModel, random init) - Predecessor: Anime-Elite-V1
Built solo across a couple of late nights. The hardest part wasn't the training β it was finding the right single change to make on top of V1 (turned out to be EMA, just EMA, nothing else). Hope the from-scratch approach is useful to anyone exploring small-diffusion territory.
- Downloads last month
- -







