yeah there is no way to run 4.7 on a 32g vram this flash is something that im al...

omneity · 2026-01-19T18:12:08 1768846328

Why not? Run it with vLLM latest and enable 4bit quantization with bnb, and it will quantize the original safetensors on the fly and fit your vram.

disiplus · 2026-01-19T19:37:58 1768851478

because how huge glm 4.7 is https://huggingface.co/zai-org/GLM-4.7

omneity · 2026-01-19T19:54:27 1768852467

Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit.

disiplus · 2026-01-19T20:12:16 1768853536

yes, but the parrent link was to the big glm 4.7 that had a bunch of ggufs, the new one at the point of posting did not, nor does it now. im waiting for unsloth guys for the 4.7 flash