Anime-Elite V2 — From-Scratch Text-to-Anime-Face Diffusion

Built for mid-tier devices. Runs on 2 GB VRAM. Trained on a single RTX 5080. No pretrained components anywhere in the stack.

Recommended First Prompt: python inference.py --prompt "1girl, portrait, long hair, looking at viewer, red eyes" --seed 56 --steps 190 --guidance 2.4

V2 is the follow-up to Anime-Elite-V1. Same from-scratch philosophy (no pretrained VAE, no LoRA over base SD, no fine-tuning). What changed is reliability — V1 had high peaks but inconsistent average samples. V2 smooths that out with EMA-averaged weights, so the floor rises significantly and the seed-to-seed variance drops.

If you want to use this model: head to Files and versions and download one of the .pt checkpoints + inference.py. That's it.

What changed from V1

EMA weights (decay 0.999) — model samples come from an exponentially averaged shadow of the training weights instead of the noisy live weights. Single biggest quality lever for small diffusion. This is what fixed V1's inconsistency.
17k training images (V1 used 10k). More diversity, more characters, more tag combinations covered.
96×96 throughout — config consistent everywhere (V1 had mixed metadata).
Same 67M-parameter architecture. Same Danbooru-style tag conditioning. Same MIT license.

Showcase

The samples below are all from V2, single-checkpoint, no post-processing, no upscaler. Just the model output as it comes out at 96×96.

These are cherry-picked but not unreachable — the reference seeds below reproduce these compositions consistently across GPUs (with minor bf16 numerical drift on different architectures).

Best sampling config

After a lot of seed sweeping, this is the config that gave the strongest results:

prompt:     1girl,portrait,long hair,<color> eyes,<color> hair
guidance:   1.8 – 2.5
DDIM steps: 50 – 200 (200 = sharper, 50 = faster)
seeds:      26, 56 (peak compositions on the reference checkpoint)

Higher guidance pushes tag adherence but can wash out colors. Stay in 1.8–2.5 for the cleanest results.

How to use

pip install torch diffusers pillow
python inference.py --ckpt ckpt_e040.pt --prompt "1girl,portrait,long hair,red eyes" --seed 26

That's it. The script auto-detects EMA weights inside the checkpoint and uses them. Output saves to out/.

For batch sampling or to scan many seeds, see seed_sweep.py linked in the repo (generates a labeled grid of N seeds in one pass).

Tag prompting

This model uses Danbooru-style tags, comma-separated. Exact matches only — no CLIP, no natural language understanding.

1girl not girl. red hair not red hairs. blue eyes not blue eye.
Stack 4–8 tags per prompt. Anchor with high-frequency tags first (1girl,portrait,long hair) then add specifics.
Common tags in the vocab: 1girl, portrait, long hair, short hair, blue eyes, red eyes, green eyes, purple eyes, pink eyes, red hair, blue hair, brown hair, white hair, pink hair, purple hair, green hair, smile, blush, looking at viewer, floral background, choker, closed mouth, bangs.
Print vocab after loading to see the full 512-tag list.

Limitations

Honest list:

96×96 only. Small. Pair with a Real-ESRGAN or similar upscaler for usable sizes. The model produces clean 96px output; upscaling is a separate concern.
Heavy female bias. Dataset is ~90%+ female anime characters. 1boy mostly produces feminine-coded outputs because the model rarely saw counter-examples.
Tag exact-match. Typos = silent ignore. Always check the printed Matched tags line.
Hit rate ~20–25%. Generate 5–10 candidates per prompt and pick the best — this is how all small diffusion models are used. Don't judge from a single sample.
No safety checker. Anime faces, no real people. Use at your own discretion.

What's in this repo

ckpt_e040.pt, ckpt_e045.pt, ckpt_e050.pt — EMA-weight checkpoints from the final stretch of training. e040 is the recommended default. Each ~270 MB.
inference.py — single-file CLI inference script
peak1.png ... peak6.png — direct model outputs, no upscaling
This README

Training details

Optimizer: AdamW, lr=1e-4 constant, betas=(0.9, 0.999), wd=1e-6
Scheduler: DDPM, 1000 steps, squaredcos_cap_v2
Noise prediction loss (plain MSE — no min-SNR weighting after V2 experiments showed it hurt at this scale)
CFG dropout: 10% null condition during training
Mixed precision: bf16 autocast
Batch size: 16
50 epochs × 1062 steps = ~53k total steps
EMA: decay 0.999, updated every step
Hardware: RTX 5080 16 GB, Windows 11, ~3 hours wall clock

Acknowledgements

Dataset: puruchinera/anime-faces-256
Architecture: HuggingFace diffusers (UNet2DConditionModel, random init)
Predecessor: Anime-Elite-V1

Built solo across a couple of late nights. The hardest part wasn't the training — it was finding the right single change to make on top of V1 (turned out to be EMA, just EMA, nothing else). Hope the from-scratch approach is useful to anyone exploring small-diffusion territory.

Downloads last month: -

Rohanify
/

Anime-Elite-V2