atlas-nvfp4-paged-attention
Paged-KV attention kernels for the full-attention layers of Qwen3.6 hybrid models on NVIDIA GB10 (DGX Spark, SM121).
Ops
| Op | KV format | Use |
|---|---|---|
paged_decode_attn_bf16 |
BF16 | Reference / debugging |
paged_decode_attn_fp8 |
FP8 E5M2 + scales | Mainline FP8 deployment |
paged_decode_attn_nvfp4 |
Block-scaled E2M1 | NVFP4 deployment |
rms_norm |
BF16 | Pre-attention / pre-FFN norm |
Prefill counterparts (inferspark_prefill_paged*) are compiled into the
shared object — Torch bindings for them ship in the next iteration once
the chunked-prefill scheduling story is settled.
Hardware
GB10 only (sm_121f, compute capability 12.1).
- The NVFP4 path uses Atlas's software E2M1 conversion since
cvt.rn.satfinite.e2m1x2.f32is missing on SM121. block_size=16andhead_dim=256are the layouts that ship today.
Models tested
| Model | Attention layers | Heads (Q:KV) | Head dim |
|---|---|---|---|
| Qwen/Qwen3.6-27B | 16 | 24:4 | 256 |
| Qwen/Qwen3.6-35B-A3B | 10 | 16:2 | 256 |
License
AGPL-3.0-only.
- Downloads last month
- 9
- OS
- linux
- Arch
- aarch64