Technology

What is quantization? How AI models get smaller without getting much worse

Quantization is what lets a 70B model fit on consumer hardware. What it actually is, the math in one paragraph, the methods that matter (GPTQ, AWQ, GGUF, bitsandbytes, FP8), what you lose, and when to care.

By Kenji Tanaka, Insightful AI Desk

A modern language model is mostly a very large table of numbers. Llama 3 70B has 70 billion of them. Each one of those numbers, in the version the researchers shipped, occupies 16 bits of memory. Multiply it out and the model takes about 140 gigabytes of GPU RAM just to load — before you have served a single request, before any user has typed anything, before any of the actual compute that the GPU is built to do has started. That memory cost is the reason a 70-billion-parameter model historically could not run on a laptop, why a research lab had to buy a node of eight 80-GB GPUs to host it, and why a startup with one consumer-grade card had no shot at self-hosting frontier-class open weights.

And yet, in 2026, Llama 3 70B does run on a single consumer GPU. It runs on a MacBook with 64 GB of unified memory. The same 70-billion-parameter table of numbers fits in something like a quarter of the space it originally needed, and on most benchmarks the smaller version is within one or two percentage points of the original. The technique that did this — that compressed the table without throwing the model away — is quantization. It is one of the highest-leverage ideas in practical machine learning of the last five years, it sits underneath the entire local-LLM ecosystem, and it is, in 2026, the difference between "you can run this model" and "you cannot."

This post explains what quantization actually is, the math in one paragraph, the families of methods that matter (GPTQ, AWQ, GGUF, bitsandbytes, FP8), what it costs you in accuracy, who uses what, and when you should care versus when you should not.

1. What quantization actually is

Imagine you are storing the average temperature in each city in Thailand, every hour, for a year. The real temperature is a continuous number — 28.4317 degrees, 30.0921 degrees, and so on. You could store it as a 64-bit double-precision float and capture every digit. You could also notice that you do not actually need fifteen digits of precision to track weather, and decide to round each reading to the nearest tenth of a degree. Now each reading fits in a much smaller field, the file is one-quarter the size, and for every practical purpose — graphing, summarising, comparing months — the lossy version is just as good as the lossless one.

Quantization is that idea, applied to the weights of a neural network. The weights of a model like Llama 3 70B are not arbitrary precise quantities; they are the output of a noisy optimisation process, they are statistically clustered around zero, and the model's behaviour is robust to small perturbations of any individual weight. The question quantization asks is: given that the model is already approximate, how few bits per weight can we get away with before the output stops being usable?

The original training weights are typically stored in 32-bit floating point (FP32, four bytes per weight) or 16-bit floating point (FP16 or BF16, two bytes per weight). Quantization compresses each weight down to 8 bits (one byte), 4 bits (a nibble), or in aggressive variants 3, 2, or even fewer bits per weight. The savings are linear in the bit width. Going from FP16 to 4-bit integer quantization shrinks the model by a factor of four. A 70-billion-parameter model that needed 140 GB in FP16 needs roughly 35 GB at 4-bit, plus a small amount of overhead for the metadata that tells the model how to convert the integers back into approximate floats at inference time.

The compute story is similar but more nuanced. Modern GPUs have hardware-accelerated paths for INT8 and FP8 arithmetic that are faster than the FP16 path on the same silicon — sometimes two-to-four times faster, depending on the operation. INT4 generally does not have a native hardware path, so 4-bit quantization typically saves memory and bandwidth but is dequantised back to FP16 before the actual matrix multiply. Even there, the bandwidth win — moving four-times-fewer bytes from GPU memory into the compute units — is often what dominates real-world throughput.

2. The bit-width staircase

The standard precisions used in practice form a staircase from highest to lowest:

FP32 (32 bits, 4 bytes/param). The default for traditional training. Rarely used for inference in 2026 because FP16 is nearly indistinguishable and twice as cheap.
FP16 / BF16 (16 bits, 2 bytes/param). The default training precision for almost every modern LLM. BF16 differs from FP16 in giving up some mantissa precision in exchange for a wider exponent range, which matters during training. For inference the two are roughly interchangeable.
FP8 (8 bits, 1 byte/param). Two variants exist on NVIDIA's Hopper and Blackwell silicon — E4M3 (four exponent bits, three mantissa bits) and E5M2 (five exponent, two mantissa). E4M3 is preferred for forward-pass weights and activations; E5M2 for backward-pass gradients during training. Now common in production training and inference at frontier labs.
INT8 (8 bits, 1 byte/param). The first quantization step that most teams try because the hardware path is well-supported and the accuracy hit is usually negligible on standard benchmarks (typically well under two percentage points).
INT4 (4 bits, 0.5 bytes/param). The current sweet spot for local inference. Accuracy degradation is small for most modern LLMs — typically one to four points on aggregate benchmarks — and the memory savings are decisive. Most "Q4_K_M" or "AWQ-INT4" model files you see on Hugging Face live here.
INT3 / INT2 / sub-2-bit (aggressive). Useful for extreme memory-constrained deployment but the accuracy cost grows sharply. INT2 typically loses five to fifteen points across benchmarks; sub-2-bit is research-grade and rarely shipped.

The memory formula is simple. For a model with $N$ parameters and bit-width $b$:

$$\text{memory} \approx N \times \frac{b}{8} \text{ bytes}$$

For Llama 3 70B at 4 bits, that is $70{,}000{,}000{,}000 \times 0.5 = 35$ GB before any per-block metadata. The total file size on disk is usually 10–20% larger because of the scaling factors and zero-points that quantization requires.

3. The math, in one paragraph

The core operation is a linear mapping from a high-precision range to a low-precision integer range. Given a tensor of floating-point weights with values spanning roughly $[w_{\min}, w_{\max}]$, quantization picks a scale $s$ and a zero-point $z$ such that every weight $w$ is encoded as the integer:

$$q = \text{round}\!\left(\frac{w}{s}\right) + z$$

At inference time, the integer is converted back to an approximate float via:

$$\hat{w} = s \cdot (q - z)$$

The reconstructed $\hat{w}$ differs from the true $w$ by at most half a step of size $s$ — that is the quantization error. The whole game is choosing $s$ and $z$ well, and choosing how to group weights so that one $(s, z)$ pair covers a group of weights whose value range is small enough that the rounding error stays acceptable. The three common groupings are per-tensor (one scale for the whole weight matrix, simplest and least accurate), per-channel (one scale per row or column, the common middle ground), and per-group (one scale per consecutive block of 32–128 weights, most accurate and what GGUF and AWQ use).

Everything else — calibration data, error compensation, mixed precision per layer, non-uniform code tables — is a refinement on top of this skeleton.

4. The two regimes: PTQ and QAT

Quantization methods split cleanly into two camps based on when the quantization happens.

Post-training quantization (PTQ) takes a model that is already trained in FP16 or FP32 and converts it offline, in a separate pass after training is done. PTQ typically runs in minutes to hours, requires only a small amount of calibration data (a few hundred to a few thousand text samples), and does not touch the original training pipeline. Because it is cheap, fast, and applies to any pretrained model, PTQ is what the open-weight ecosystem uses almost exclusively. Every quantized Llama, Mistral, Qwen, DeepSeek, or Gemma model you have ever downloaded was produced via PTQ.

Quantization-aware training (QAT) bakes the quantization step into the training loop itself. During the forward pass, weights and sometimes activations are simulated as their quantized values, so the model learns parameters that are robust to the rounding noise. QAT typically produces a more accurate quantized model than PTQ at the same bit width — the gap can be one or two percentage points at INT4, which is the difference between a usable INT4 model and a noticeably-worse-than-FP16 one. The cost is real: QAT requires re-running a significant portion of training, and for billion-parameter LLMs that bill is enormous. As a result, QAT is rare for general-purpose LLMs and more common for production deployments of smaller models where the training run is affordable.

A middle option that has matured in 2024–2025 is fine-tune-with-quantization, in which a pretrained FP16 model is taken through a short additional training pass — sometimes only a few thousand steps — to recover accuracy lost to PTQ. The QLoRA paper popularised this for the open-weight community: rather than train a full QAT model, you load the model in 4-bit and train only a small adapter on top. The original weights stay quantized; the adapter is FP16. The result is a quantized model that, for fine-tuning purposes, behaves like an FP16 one. This is now the standard recipe for low-VRAM fine-tuning on consumer hardware.

5. The methods that matter

Within PTQ there is a small group of methods that ship the bulk of real-world quantized models in 2026. Each one is a different way of choosing scales, groupings, and error-compensation strategies. The fact that several can coexist is a sign that the problem is genuinely workload-dependent — there is no single best method, and the trade-offs between accuracy, file size, hardware support, and inference speed are real.

GPTQ (Frantar et al., 2022) is the original heavyweight method. It quantizes the model one weight column at a time, and after each column compensates the remaining un-quantized columns to absorb the rounding error introduced. The compensation is computed from an approximation of the inverse Hessian of the per-layer loss. GPTQ is mathematically principled, produces excellent accuracy at INT4, and is the method most published "fair comparison" benchmarks anchor against. The cost is calibration time — a 70B model takes hours to quantize on a single GPU — and the method is single-bit-width per layer.

AWQ (Activation-aware Weight Quantization, Lin et al., 2023) takes a different angle. Instead of trying to compensate for quantization error after the fact, AWQ observes that a small minority of weight channels carry most of the model's signal — they are the channels whose activations are large during inference. AWQ identifies those salient channels (using a small amount of calibration data) and scales them up before quantization so that the rounding error on the channels that matter is small, then scales the corresponding activations down at inference time so the product is preserved. The result is INT4 quantization with accuracy typically within half a point of GPTQ and significantly faster quantization runs.

GGUF (GPT-Generated Unified Format, Gerganov and contributors) is the format that the llama.cpp ecosystem uses, and it deserves attention because it is the format most non-cloud users will encounter. GGUF supports a family of "k-quant" variants — Q2_K, Q3_K, Q4_K, Q4_K_M, Q5_K_M, Q6_K, Q8_0 — each one a different bit-width-and-grouping recipe with non-uniform precision across layers. The recipes assign more bits to the layers that matter most (attention output projections, feed-forward up-projections) and fewer to the layers that are most quantization-tolerant. The Q4_K_M variant, in particular, has emerged as the default sweet spot for local LLM use: roughly 4.8 bits per weight on average, with accuracy typically within one or two points of FP16, and small enough to run on consumer hardware.

bitsandbytes (Dettmers and collaborators) is the library that made 4-bit quantization easy in PyTorch. It does on-the-fly quantization at model-load time, supports the NF4 format (a non-uniform 4-bit representation that exploits the normal distribution of weights), and pairs naturally with QLoRA fine-tuning. The accuracy is competitive with GPTQ for most workloads, and the integration is one line in a HuggingFace model load call, which is why it is the de-facto default in research and academic projects.

FP8 is the hardware-native 8-bit floating-point family that NVIDIA's H100 and B100 chips support directly. FP8 is not, technically, "quantization" in the integer sense — it is a smaller floating-point format. But the use case is the same: less memory and bandwidth per weight. FP8 inference is increasingly common at frontier labs because the hardware path is efficient and the accuracy is essentially indistinguishable from FP16. DeepSeek pushed FP8 further in 2024–2025 with custom kernels and FP8 training, demonstrating that even the training-time bill could be paid in 8-bit floats without quality regression on properly tuned recipes.

6. What you lose: the failure modes

The marketing-level summary of quantization — "almost no accuracy loss at INT4" — is true on aggregate benchmarks like MMLU or HellaSwag. It is misleading in two specific ways that matter in production.

The first is that aggregate benchmark scores hide tail failures. A 4-bit Llama 3 70B may score 78% on MMLU compared to 80% for the FP16 version. The two-point drop is genuine but small — and yet, dive into the per-task breakdown and you typically find that the loss is concentrated in a few categories. Math word problems, multi-step reasoning, code generation under tight specifications, and long-context retrieval are the categories that tend to degrade disproportionately. If your real-world use case lives in one of those categories, the aggregate-benchmark number is comfortingly close to FP16 in a way that does not reflect what your users will experience.

The second is that quantization changes the failure mode, not just the failure rate. A quantized model is more likely to produce confidently-wrong outputs in domains where it would have hedged in FP16. It is more likely to mis-handle rare tokens — non-English scripts, technical notation, specific named entities — because those tokens contribute small gradient signal during the calibration pass that picks the per-tensor scales. Production teams who switch from FP16 to INT4 typically see a small drop in pass rate on their eval suite and a larger drop in user satisfaction for non-English or technical workloads, even when overall accuracy looks unchanged.

Two other gotchas are worth flagging. Activations sometimes have outlier channels — a handful of dimensions where the activation values are an order of magnitude larger than the typical channel. Naively quantizing activations to INT8 with a single per-tensor scale ends up wasting most of the dynamic range on the outliers and under-resolving everything else. The SmoothQuant technique (Xiao et al., 2022) mitigates this by mathematically migrating the outlier magnitude from activations into weights, where it can be absorbed by per-channel scales. Modern quantization stacks bake this in by default; if you are working with an older codebase, activation outliers are the most common reason an INT8 model is unexpectedly worse than its INT4 counterpart.

Finally, calibration data matters. Most PTQ methods need a small dataset — typically 128 to 2,048 text samples — to estimate weight and activation distributions. The standard practice is to use a generic mix like the C4 corpus, and for most models this is fine. But if your deployment workload is far from the calibration distribution — heavy code, heavy non-English, heavily structured input — the calibration-driven choice of scales may be visibly mis-tuned for what you actually do. A simple recipe in those cases is to recalibrate on a sample of your real production prompts, which often recovers a measurable fraction of the accuracy gap.

7. Quantization in the real world

The deployment patterns that have settled out by 2026 look roughly like this.

For local single-user inference — running a model on a laptop, a desktop with a consumer GPU, or a Mac with unified memory — the canonical stack is llama.cpp serving a GGUF file, usually in the Q4_K_M variant. A 70-billion-parameter model in that variant is about 42 GB on disk and runs comfortably on a 64-GB Mac at roughly 5–10 tokens per second of output. The same model in FP16 would not even load. This is the regime that quantization most dramatically opened up. Five years ago, "run a frontier-class LLM on your own hardware" was a sentence about server farms; today it is a sentence about laptops, because of quantization plus the inference engineering of llama.cpp.

For self-hosted production inference — a startup or enterprise serving an open-weight model behind their own API — the typical stack is vLLM or SGLang serving AWQ-quantized or GPTQ-quantized weights. INT4 is common for memory-bound deployments where the goal is to fit a larger model on existing hardware. INT8 or FP8 is common for compute-bound deployments where the goal is to maximise throughput. The choice between methods often comes down to which one the underlying inference engine supports best for the specific model architecture — the ecosystem of (engine × method × model architecture) is uneven, and "this model is supported in AWQ but not GPTQ on vLLM" is a sentence you see more often than the marketing implies. Our explainer on vLLM digs into the serving-side considerations.

For frontier training — the closed-source labs and the research-heavy open ones — the move toward FP8 has been the dominant story of 2024–2025. NVIDIA's Transformer Engine made FP8 training accessible on H100; DeepSeek published recipes for FP8 training on commercial-scale runs; subsequent open releases from Meta and others have adopted at least partial FP8 in their training stack. The accuracy bar at training time is higher than at inference (gradients are more sensitive than weights), and FP8 training is still considered a careful art rather than a turnkey switch, but the trajectory is clear: 8-bit floats are now a first-class training precision, not just an inference one.

For fine-tuning on consumer hardware — the world of LoRAs, QLoRAs, and hobbyist customisation — quantization is the only reason this category exists at all. QLoRA's recipe of "load the base model in 4-bit, train a small FP16 adapter" lets a single 24 GB consumer GPU fine-tune a 13-billion-parameter model. Without quantization, that workload would require 60 GB of VRAM and a workstation card. The community ecosystem that produces specialised models for narrow domains — coding assistants, role-play, region-specific chatbots — runs on this recipe.

8. When you should care (and when not)

The pragmatic question for most teams is not "is quantization good," it is "is quantization good for what I am doing." A short heuristic:

Care about quantization if you are running an open-weight LLM and any of these are true: memory is your bottleneck; you want to run a bigger model than your hardware would otherwise allow; you are fine-tuning on consumer hardware; you are serving high-volume inference and the per-request cost matters more than the last point of accuracy; or you are deploying on an edge device where every byte of bandwidth counts.

Do not particularly care about quantization if you are using a hosted frontier API (OpenAI, Anthropic, Google) — the provider handles whatever quantization is happening, and the relevant axis for you is price-per-token, not bit-width. You also do not need to care if your workload is one where the accuracy gap is decisive — high-precision math, novel reasoning, safety-critical decisions, or any task where a one-point accuracy drop has a meaningful real-world cost.

The single most useful piece of advice for teams new to this space is to measure on your own data. The published benchmark numbers for quantized models are honest but generic. The accuracy gap on your real workload — your prompts, your evaluation suite, your domain — is what matters, and it is rarely identical to the public benchmark gap. A few hours of evaluation harness work to compare an INT4 model against the FP16 baseline on your eval set will tell you more than a week of reading other people's blog posts.

A second piece of useful advice is to match the method to the model. The accuracy of AWQ vs GPTQ vs GGUF on a specific model is not predictable from first principles; it depends on the model architecture, the calibration data used by the model's quantizer, the bit-width, and the grouping. The pragmatic move is to download two or three quantized variants of the same model from Hugging Face — often the most popular variants are pre-quantized by community members who have already tuned the parameters — and benchmark them against each other.

9. Where to read next

The primary papers, in order of how much each one rewards reading:

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., 2022) is the canonical reference and the most cited method. The math is heavier than the others but the systems intuition is excellent.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (Lin et al., 2023) explains the salient-channel observation that makes the method work, and is a cleaner read than GPTQ if you only have time for one.

QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023) is the paper that opened up consumer-hardware fine-tuning. The introduction is one of the best short summaries of why quantization matters that has been written.

SmoothQuant (Xiao et al., 2022) is the paper to read if you want to understand activation outliers and why naive INT8 quantization sometimes fails.

The llama.cpp repository is the practitioner reference for GGUF and the k-quant variants — the README and the discussion threads are an unusually good source of empirical knowledge about which quant works for which model. The bitsandbytes documentation on Hugging Face is the easiest entry point for runtime quantization in PyTorch.

For the broader picture of how quantization fits into the rest of the inference stack, our explainer on vLLM covers the serving side, and the AI Encyclopedia curriculum is the place to read next — quantization sits in Phase 105 (Model Compression and Efficiency) alongside pruning, distillation, and the other techniques that make models smaller without making them worse.

The one-line takeaway, if you keep one thing: quantization is what lets a 70-billion-parameter model fit on hardware its designers never imagined, and the modern open-weight ecosystem exists because the accuracy cost turned out to be small enough to pay.

Further reading: GPTQ (Frantar et al., 2022), AWQ (Lin et al., 2023), QLoRA (Dettmers et al., 2023), SmoothQuant (Xiao et al., 2022), and the llama.cpp project documentation.