Anima Preview 3 SDNQ INT8 Diffusers Checkpoint

8-bit int8 dynamic SDNQ quantization of the Anima Preview 3 diffusion transformer, packaged as a full Diffusers pipeline. This is the fastest measured checkpoint in this comparison; the companion checkpoints are listed in the benchmark table below.

This repository is a separate full Diffusers checkpoint for circlestone-labs/Anima Preview 3. The pipeline code and non-transformer components are based on the public Diffusers conversion CalamitousFelicitousness/Anima-Preview-3-sdnext-diffusers. The transformer/ component is the WaveCut SDNQ-quantized diffusion transformer converted from WaveCut/Anima-Preview-3-SDNQ-int8.

Components

  • transformer/: SDNQ int8 quantized CosmosTransformer3DModel.
  • llm_adapter/: Anima LLM adapter required by the native Anima architecture.
  • text_encoder/: Qwen3 0.6B text encoder from the Diffusers conversion.
  • tokenizer/ and t5_tokenizer/: Qwen and T5 tokenizers used by the adapter pathway.
  • vae/: Qwen Image / Wan-style VAE used by Anima.
  • scheduler/: FlowMatchEulerDiscreteScheduler with shift 3.0.

Usage

Install current Diffusers/Transformers plus SDNQ support, then load the pipeline:

import torch
import sdnq
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "WaveCut/Anima-Preview-3-SDNQ-int8-diffusers",
    custom_pipeline="pipeline",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to("cuda")

prompt = "masterpiece, best quality, score_7, safe, 1girl, fern (sousou no frieren), purple hair, purple eyes, black robe, white dress, butterfly on hand, simple background, looking at viewer"
negative_prompt = "worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, artist name"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1024,
    height=1024,
    num_inference_steps=30,
    guidance_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(424242),
).images[0]

Because the Anima pipeline is custom code, pass custom_pipeline="pipeline"; trust_remote_code=True allows Diffusers to load pipeline.py from this repo.

Prompting

Anima was trained on Danbooru-style tags, natural language captions, and mixtures of both. The upstream Anima Preview 3 card recommends about 1MP generation, for example 1024x1024, 896x1152, or 1152x896, with roughly 30-50 steps and CFG 4-5.

Recommended positive prefix:

masterpiece, best quality, score_7, safe,

Recommended negative prompt:

worst quality, low quality, score_1, score_2, score_3, artist name

Use lowercase tags with spaces instead of underscores, except score tags such as score_7. For artist tags, prefix the artist with @.

1024x1024 Comparison Grid

Five prompt/seed pairs were generated with the original BF16 Diffusers checkpoint, the companion UINT4 checkpoint, and this INT8 checkpoint. The source JPEG is 3572x5576; every generated cell is exactly 1024x1024 and pasted 1:1 with no resizing.

Anima Original BF16 vs SDNQ UINT4 and INT8 1024x1024 grid

Prompt IDs and seeds are printed in the left column of the grid. Raw benchmark data is available in benchmarks/benchmark_results_1024.json.

Benchmark

Measured on an RTX 5090 32GB with torch 2.8.0+cu128, diffusers 0.38.0, transformers 5.8.1, sdnq 0.1.8, torch.bfloat16, 24 steps, CFG 4.0, and 1024x1024 output. Network download is excluded. Each model was loaded in a separate process; one 1024x1024 warm-up image was discarded, then five prompt/seed pairs were measured. VRAM was sampled with nvidia-smi every 50 ms.

Model Repo Size Load time Mean generation Speed vs original VRAM after load Peak VRAM while generating
Original BF16 CalamitousFelicitousness/Anima-Preview-3-sdnext-diffusers 5.3 GiB 10.04s 6.37s/img 1.00x 6005 MiB 10759 MiB
SDNQ UINT4 WaveCut/Anima-Preview-3-SDNQ-uint4-diffusers 2.7 GiB (-49.1%) 11.96s 6.13s/img 1.04x (+3.9%) 3285 MiB (-45.3%) 8157 MiB (-24.2%)
SDNQ INT8 WaveCut/Anima-Preview-3-SDNQ-int8-diffusers 3.5 GiB (-34.1%) 22.41s 4.60s/img 1.38x (+38.4%) 4111 MiB (-31.5%) 8961 MiB (-16.7%)

Quant-to-quant tradeoff in this run: UINT4 is 22.7% smaller than INT8 and uses 826 MiB less VRAM after load plus 804 MiB less peak generation VRAM. INT8 is 1.33x faster than UINT4 on this RTX 5090 setup.

Notes

The original Anima split checkpoint is a ComfyUI-native model with a Qwen3 text encoder and a learned LLM adapter. Earlier transformer-only exports that load the checkpoint directly as CosmosTransformer3DModel ignore the llm_adapter.* weights; this repo keeps the adapter and full pipeline structure so generation follows the Anima architecture.

License follows the upstream Anima/CircleStone non-commercial license and the NVIDIA Cosmos derivative terms referenced by the upstream model card.

Downloads last month
72
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WaveCut/Anima-Preview-3-SDNQ-int8-diffusers

Finetuned
(28)
this model

Collection including WaveCut/Anima-Preview-3-SDNQ-int8-diffusers