[VPTQ] Part 1: An Overview of Vector Post-Training Quantization
Using our VPTQ project, a large model with 70/72 billion parameters can run smoothly on a 24G 4090.
Recently, I finally managed to gather some insights from the busy project I’ve been working on, and now I have a moment to share an introduction. This post will give an overview of the VPTQ project, with more detailed discussions on specifics and related work to follow. Let me set the stage for you. 🙂
The project is ongoing, so feel free to drop any questions in the comments! You might notice a mix of English and Chinese in the blog — bear with me! 😄
Too Long; Didn’t Read (TL;DR)
Using our VPTQ project, a large model with 70/72 billion parameters can run smoothly on a 24G 4090.
In a nutshell: VPTQ can rapidly (in just a few hours) quantize large language models (LLMs) to extremely low bit-widths (1–4 bits) while maintaining good model accuracy. By employing indices and vector lookup tables, VPTQ allows for decompression from ultra-low bit compressed data back to the original weights during inference, requiring only table lookups.
Resources
- Paper: [arXiv](https://arxiv.org/abs/2409.17066) / [Hugging Face](https://huggingface.co/papers/2409.17066)
- Code: [GitHub](https://github.com/microsoft/VPTQ)
- Community-released models:
- [Hugging Face](https://huggingface.co/VPTQ-community) includes Llama 3.1 7B, 70B, 405B and Qwen 2.5 7B/14B/72B models (@4bit/3bit/2bit/~1bit).
More updates are coming, so stay tuned!
Background (For more details, see upcoming posts 😉)
0. Technical Classification of VPTQ
Weight-Only Quantization: VPTQ is a technique that compresses only the model weights/parameters, not simplifying the computational process. This is crucial because:
- The vast amount of weights/parameters in LLMs is the primary challenge in deploying them, as current GPUs and accelerators have excellent floating-point and fixed-point computational capabilities (FLOPs/OPs). However, if on-board memory (DRAM/HBM) cannot accommodate the model weights and if hardware communication is weak, high-performance model inference becomes unfeasible.
A major bottleneck in LLM inference performance is memory. - During the prefile stage of an LLM, large amounts of weights need to be copied from HBM/DRAM to shared memory/register; during the decode stage, it’s definitely memory-bound, involving extensive GEMV operations. Thus, weight-only quantization is meaningful.
Post-Training Quantization: VPTQ employs post-training quantization, a lightweight and fast method compared to Quantization-Aware Training (QAT).
- PTQ does not require model backpropagation (though, to compare with other fine-tuning methods, we fine-tuned a few parameters minimally in our studies).
- Given the massive size of LLMs, end-to-end training/finetuning/QAT would require significant computational resources, which would limit the applicability of the method. For instance, quantizing a 405B model would need at least two nodes to run the teacher model, which is a high requirement.
1. Related Concepts: Vector Quantization (VQ)
- Vector Quantization: Widely used in traditional communication fields and neural networks (like VQ-VAE), the core idea of VQ (learn more [here](https://en.wikipedia.org/wiki/Vector_quantization)) is to organize data into vectors, cluster them, and save the clustering results as a codebook. The original data can then be represented by indices. During LLM inference, we only need to decompress using the indices and codebook before executing the current operator to retrieve the desired original weights.
Stay tuned for more detailed introductions to works like GPTVQ, AQLM, Quip#, and others that are gradually trying to use VQ methods to quantize model weights. These approaches also have their unique features, which I’ll discuss in subsequent posts!
2. Related Concepts: Second-Order Optimization-based PTQ:**
- Second-Order Optimization-based PTQ: Personally, I favor this classification, which originated from a seminal 1989 paper by Lecun, “Optimal Brain Damage”. The core idea of second-order optimization-based PTQ is to formulate model quantization as an optimization problem, using a Taylor expansion to unfold the optimization target into first and second-order terms, discarding higher-order terms. The model’s first-order term is assumed to be zero (the model has reached a local optimum), and the second-order term becomes the optimization target for quantization algorithm design.
Playing with Demos
Instead of getting lost in the details, why not check out the demo directly? Visit [GitHub](https://github.com/microsoft/VPTQ)'s README to have some fun with it!
Environment Setup
Before you start, you might need to set your CUDA path.
export PATH=/usr/local/cuda-12/bin/:$PATH # Customize based on your environment.
Installing the VPTQ library might take a few minutes to compile the CUDA kernel, which currently only includes SM80, SM86, SM90 (support for Volta/V100/SM70 and later architectures is available, but compiling takes longer). Modify setup.py as needed.
pip install git+https://github.com/microsoft/VPTQ.git - no-build-isolation
Try It Out
Why not try a model from the open-source community, like the VPTQ-community/Meta-Llama-3.1–70B-Instruct-v8-k65536–0-woft (roughly equivalent to 2-bit quantization):
python -m vptq - model=VPTQ-community/Meta-Llama-3.1–70B-Instruct-v8-k65536–0-woft - prompt="Explain: Do Not Go Gentle into That Good Night"
To explain this famous poem:
Or start a chatbot:
python -m vptq - model=VPTQ-community/Meta-Llama-3.1–70B-Instruct-v8-k65536–0-woft - chat
In Closing:
Future posts may cover PTQ-related work, core algorithmic design under optimization problem guidance, inference part kernel optimization, and some serious discussions about quantization. If there’s anything else you’d like to know, just reply directly. The VPTQ project is still ongoing, with plenty of improvements and applications in the works. If you’re interested, feel free to contact me!