Quantization for 8GB GPU: Which Levels Actually Work and Which Ones Lie

You loaded the model, generation started, and everything looked fine. Twenty minutes into the session it dropped to 2 tokens per second. The VRAM number said it would fit. The quant level matched what the guide recommended. Nothing in the terminal told you something was wrong until the output was crawling. This is the specific failure mode that 8GB GPU owners run into, and it is not a configuration problem. It is a quant level choice that looked correct at load time and fell apart under session pressure. This post covers which quantization levels actually hold up on 8GB under real conditions and which ones only appear to work until they don’t.

What 8GB Actually Gives You to Work With

The number on the box is not the number you have available for inference. Before a model loads, your GPU has already committed memory to the CUDA context, driver overhead, and any background processes touching the card. On a typical Windows system with an RTX 4060 or RTX 3060 Ti 8GB, that overhead runs between 300MB and 700MB before Ollama or llama.cpp touches a single model weight. Some systems run higher depending on display configuration and driver version.

The KV cache is the second claim on your VRAM, and it is the one most guides leave out of the math. The KV cache stores key and value vectors for every token in the active context so the model does not have to recompute the full conversation history with each new token. It grows with every token generated and with every token in the input prompt. At 2K context on a 7B model, the KV cache adds roughly 500MB to 700MB of VRAM consumption on top of the model weights. At 8K context, that number grows to 2GB or more depending on the model architecture. On an 8GB card, that is not a footnote. That is the difference between a session that runs cleanly and one that starts spilling to system RAM halfway through a conversation.

The practical inference budget on an 8GB card is not 8GB. After overhead and a working KV cache at moderate context, you are operating with roughly 5.5GB to 6.5GB of usable headroom for model weights. Every quant level decision starts from that number, not from the number on the spec sheet.

The Quant Levels That Work Cleanly on 8GB

Q4_K_M on a 7B or 8B model is where 8GB hardware actually has room to operate. Mistral 7B at Q4_K_M loads at approximately 4.4GB. Qwen 3.5 8B at Q4_K_M loads at approximately 5GB. Both numbers leave genuine headroom for the KV cache to grow across a long session without hitting the VRAM ceiling. Generation speed on these loads sits at 15 to 25 tokens per second on cards like the RTX 4060 and RTX 3060 Ti, which is fast enough to feel responsive for chat, coding assistance, and document work.

The reason Q4_K_M works here is not just that it fits. It fits with margin, and that margin is what keeps generation speed stable as the context grows. A model that loads at 4.4GB on an 8GB card has roughly 3GB of working space for overhead and KV cache before anything starts competing for memory. At normal session lengths and context windows, that 3GB buffer is enough that VRAM never fully saturates. Generation speed stays consistent from the first token to the last token in a long session. That consistency is what the number on the spec sheet does not tell you and what the VRAM calculator cannot predict for your specific session behavior.

Q4_K_S on the same 7B models saves a small amount of VRAM compared to Q4_K_M, typically under 200MB, at a measurable quality cost. On 8GB where Q4_K_M already fits cleanly, Q4_K_S is not a meaningful optimization. The headroom it recovers does not change what the card can run. If Q4_K_M fits, use Q4_K_M.

// cross_reference

Why Is AI Treated Like a Dirty Word?

engineeredai.net → read

The Quant Levels That Work Dangerously on 8GB

Q4_K_M on a 14B model loads at approximately 8.5GB. On an 8GB card, that number already exceeds the available VRAM before overhead and the KV cache enter the equation. What actually happens is partial CPU offload: the layers that do not fit in VRAM get pushed to system RAM, and inference runs at a mix of GPU and CPU speed. The model appears to load. Generation starts. For short prompts and brief responses, it may feel acceptable.

The failure mode arrives when context grows. As the KV cache expands with each exchange, more of the model gets pushed to CPU offload to make room. Generation speed does not drop cleanly from 15 tokens per second to 10 tokens per second. It drops from 10 to 5 to 2 as more layers cross the VRAM boundary. By the time a session has accumulated 3K to 4K tokens of context, a 14B model at Q4_K_M on 8GB is generating slowly enough to be practically unusable for interactive work. The model loaded. It never worked the way a model that actually fits works.

Q5_K_M on a 7B model is a different version of the same trap. A 7B model at Q5_K_M loads at approximately 5GB, which technically fits on 8GB with overhead. The problem is that Q5_K_M on 7B costs you context headroom you almost certainly need more than the quality improvement Q5 provides over Q4_K_M on a 7B model. At Q4_K_M, the same 7B model left you 3GB of working space for KV cache. At Q5_K_M, that working space shrinks to closer to 2GB. The quality difference between Q4_K_M and Q5_K_M on a 7B model for general chat and writing is not meaningful enough to justify surrendering a third of your session headroom. The quantization comparison between Q4, Q5, and Q8 covers this tradeoff in detail.

The Quant Levels That Do Not Work on 8GB

Q8_0 on a 7B model loads at approximately 7.7GB. On a theoretical 8GB budget that number is close. On a real 8GB card with overhead already consuming 300MB to 700MB, it does not fit entirely in VRAM. What loads is a model in partial offload that generates at CPU-bound speeds. Q8 on 8GB hardware is not a quality upgrade. It is a slower model with the same quality ceiling you would have gotten from Q4_K_M at a fraction of the VRAM cost, running at a fraction of the generation speed.

Q5_K_M on a 14B model needs approximately 10GB to 11GB. It will not load in VRAM on an 8GB card. It will offload aggressively to system RAM and generate at speeds that make it impractical for any interactive use. If you have ever pulled a 14B model at Q5 on 8GB hardware and watched Ollama or llama.cpp hang for 30 seconds before producing output, you have already run this experiment. The terminal output looks normal. The VRAM is not the whole story. The moment system RAM becomes part of the inference path, the generation speed tells you what happened even if the logs do not.

The distinction that matters is this: “it loaded” and “it works” are not the same statement on 8GB hardware. A model that loaded with partial CPU offload is not running at the speed or consistency that a model fully resident in VRAM produces. The numbers in the terminal at idle do not reflect what happens when the KV cache starts growing and VRAM competition increases. For a deeper look at why inference speed degrades in these conditions, the breakdown in LLM inference explained covers what is actually happening in memory during a session.

// cross_reference

Q5 Is Usually a Waste. Here Is Why.

engineeredai.net → read

The Context Length Factor Most 8GB Guides Skip

Every VRAM estimate you will find for a quantized model is a load-time number. It tells you how much VRAM the model weights consume when the model initializes. It does not account for what happens to VRAM consumption as the session runs. The KV cache is the variable that load-time numbers do not capture, and on 8GB hardware it is the variable that determines whether a session stays stable.

At 2K context on Mistral 7B at Q4_K_M, the KV cache adds roughly 500MB on top of the 4.4GB base load, bringing total VRAM consumption to approximately 5GB. That leaves over 2GB of headroom on an 8GB card after overhead, which is enough to run a long session without VRAM pressure. At 8K context on the same model, the KV cache grows to approximately 2GB, bringing total consumption to around 6.5GB. The headroom is getting thin but the session is still fully in VRAM on most 8GB cards.

The same math on a 14B model at Q4_K_M starts from a base load that already exceeds 8GB. There is no headroom for context growth. The first few exchanges in a session run at partial offload speeds. As the context grows, more competition for the limited VRAM the card can actually provide to the model accelerates the degradation. The practical fix when you need longer context on 8GB is to explicitly set a lower context window in Ollama or llama.cpp rather than relying on the model default. Reducing the context window from 8K to 2K frees roughly 1.5GB of VRAM headroom on a 7B model, which can make the difference between a stable session and one that degrades. This is a deliberate tool for 8GB owners, not a last resort. How inference speed and memory bandwidth interact across different session conditions is covered in the LLM inference speed guide.

The Actual Decision for 8GB Owners

The rule for 8GB is not complicated once the session-behavior math is clear. Q4_K_M on a well-chosen 7B or 8B model is the standard. Everything else requires you to give something up, and most of the time what you give up costs more than what you gain.

Q5_K_M on 7B is available if the quality improvement on a specific task is worth surrendering context headroom. It is the right call for coding work where you have confirmed Q4_K_M is producing errors, and you are running short sessions where context accumulation is not the constraint. It is not the right call for general use where session length is unpredictable.

Q4_K_M on 14B is available if you need the model capability and are running short, focused prompts with low context requirements. It is not a substitute for a 14B model that actually fits. If you regularly need 14B model quality for sustained work, the correct answer is a 12GB card, not a configuration adjustment on 8GB hardware. The full breakdown of which models fit cleanly at each VRAM tier is in the local AI model guide.

Q8 on 8GB is not a real option for anything above a 3B model. The numbers do not work in VRAM, and a partially offloaded Q8 model generates slower than a fully resident Q4_K_M model while consuming more of the VRAM you have. The tradeoff moves in the wrong direction on every axis.

The LLM quantization hub covers the full picture of what quantization does across VRAM tiers beyond 8GB for readers who want to understand how the decision changes as hardware improves. For readers on even tighter hardware, the 4GB VRAM quantization breakdown covers what is actually possible at the floor most guides pretend does not exist.