atlas-nvfp4-dense-gemm
Dense NVFP4/FP8 GEMM kernels for the attention projections and dense FFN of Qwen3.6-27B on NVIDIA GB10 (DGX Spark, SM121).
Ops
| Op | Use |
|---|---|
w4a16_gemm |
NVFP4 weight × BF16 activation (standard layout) |
w4a16_gemm_t |
NVFP4 weight × BF16 activation (transposed B) |
predequant_nvfp4_to_fp8 |
Materialize NVFP4 weight as FP8 E4M3 |
fp8_gemm_t |
BF16 act × FP8 weight (transposed B) |
bf16_to_fp8 |
BF16 → FP8 E4M3 pair-wise conversion |
Hardware
GB10 only (sm_121f, compute capability 12.1). Tile shapes
M_TILE=64, N_TILE_SM=64, N_TILE_LG=128 are tuned for the 27B layout
(hidden=5120, intermediate=17408, head_dim=256).
Model tested
| Model | Hidden | Intermediate | Heads (Q:KV) |
|---|---|---|---|
| Qwen/Qwen3.6-27B | 5120 | 17408 | 24:4 |
License
AGPL-3.0-only.
- Downloads last month
- 7
- OS
- linux
- Arch
- aarch64