LilaRest/gemma-4-31B-it-NVFP4-turbo · Possible runtime-side KV pool accounting issue at native-max context

Possible runtime-side KV pool accounting issue at native-max context

by Mosai-Sys - opened 11 days ago

Hi, thanks again for publishing this release — I built on top of it while testing very long-context behavior on a single RTX 5090, and I wanted to share a small runtime-side issue I ran into in case it is useful.

This does not appear to be a model-weight issue. The problem showed up in the runtime-side KV pool accounting when pushing the model to native-max context. In the failing case, the global KV pool needed 16384 blocks but exposed only 16383 free blocks, so a 262144 total-token request was rejected even though it should fit.

The root cause appears to be that one reserved null block per physical pool was not being compensated in the pool sizing logic. After adding that compensation, the same 262144 total-token case became admissible and completed successfully.

So I would describe this as a runtime-side hybrid KV / global-pool accounting edge case rather than a problem with the model itself. I thought it might be helpful to mention in case you are doing further runtime work around long-context support for this release.

Happy to share the patch or a clean diff if useful.

Mosai-sys

clayboby

11 days ago

Hi @Mosai-Sys , I'm running Gemma 4 31B on a single RTX 5090D and have been hitting similar KV pool issues. Would really appreciate if you could share the patch or clean diff. Thanks!

clayboby

11 days ago

I run with vllm 0.19.1, AWQ version

Mosai-Sys

10 days ago

Hi @clayboby , I would treat this as a vLLM runtime allocator issue, not a LilaRest/gemma-4-31B-it-NVFP4-turbo model issue.

The symptom I hit was a native-max rejection where the global KV pool was short by exactly one block (16384 required vs 16383 free), even though the request should have fit.

My exact patch was against a newer hybrid multi-pool allocator path, so it may not apply verbatim to a plain vLLM 0.19.1 AWQ tree. But if your failure is the same one-block shortfall at native max, it is probably the same bug class. As a temporary workaround, reduce max_model_len slightly; if you share the allocator log or kv_cache_utils.py path, I can tell you whether the same fix logic should apply.

Mosai-Sys changed discussion status to closed 10 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment