- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
The ever-increasing complexity of LLM models often comes at a steep cost: greater computational requirements, increased energy consumption, and slower inference times. Enter model quantization - a powerful technique that can substantially reduce model size and accelerate inference without significantly sacrificing accuracy. Model quantization is increasingly being adopted in real-world applications where efficiency is critical. Think of a self-driving car’s real-time object detection system, which demands split-second decisions with limited onboard processing power, or a voice assistant embedded in a low-cost IoT device, where both memory and energy are constrained. In this article we will explore how quantization works, the key techniques involved, and the trade-offs it introduces. We will also fine-tune a model using Databricks Foundational Model Fine-tuning API, quantize it, and evaluate the performance of the baseline, fine-tuned, and quantized models.
What is model quantization
Model quantization is a technique for replacing a trained model’s weights with values of lower precision. The idea is that using more compact data types reduces the model's overall memory footprint and makes it less compute-intensive. If done correctly, quantization should reduce the cost of running inference without a significant degradation in the model's performance.
Quantization, in the sense we use the term here, has been formally introduced in Shannon’s fundamental paper A Mathematical Theory of Communication [1] but quantization techniques can be traced all the way back to the 19th century. Back then, discretization was used as an approximation technique. Today, in the context of neural networks, model quantization can be used both in the training and the inference phase. In this brief article, we focus our attention on the inference part of the pipeline but keep in mind that there are various architectures that leverage precision reduction in training by including quantization as part of the model architecture.
Neural nets are remarkably suitable for quantization. They are typically heavily overparameterized, which makes them memory-hungry and computationally intensive, but the presence of such a vast number of parameters also allows us to reduce precision without impacting the quality of the model. Moreover, reducing the precision all the way down to integer arithmetic not only makes the model faster for inference but enables it to run on embedded devices, which often support integer data types only.
The quantization operator
The following rounding function is one of the most basic quantization operators:
Q(r) = Δ ⋅ round( r / Δ )
where r ∈ ℝis the input that needs to be quantized, Δ is the quantization step-size, and round denotes the rounding to the nearest integer function. An essential property of the quantization operator is that the range of Q(⋅) is typically finite1 and smaller than its domain, which in the extreme case can be uncountable. For example, setting Δ to 1 results in an operator that takes any real value r and produces an integer output.
Figure 1 - The effect of quantization noise can be mitigated by increasing the number of intervals in the low amplitude regions by varying the quantization step-size. You can see a uniform quantizer (a) and a non-uniform quantizer (b) depicted above.
Note, that the quantization process is by its nature lossy as it performs a many-to-few mapping (e.g. 32-bit floating-point numbers being mapped to a narrow range of 8-bit integers). It is reversible, as r can be reconstructed from the output of Q(r), but the resulting value will not be exact. Another interesting property of the rounding function above is that it implements a uniform quantizer - i.e. the space between its output values is the same. This is not always what we want as there could be cases where the inputs of the quantization operator are more likely to be in one region than in another. For example, most of the activations of a ReLU layer might cluster near zero due to the sparsity introduced by ReLU. In that case it is more optimal to assign more options (i.e. more intervals) for the quantized values to that region. A common example of a non-uniform quantizer is the deadzone quantizer [2], which has a broader range around the zero output (the dead zone). This enables it to map a range of low-level signals to zero, thus reducing unwanted noise. Another example of a non-uniform quantizer is the μ-law algorithm [3], which applies a logarithmic compression function to the input of Q(⋅).
Figure 2 - Symmetric (a) vs. asymmetric (b) quantization.
Note that both the rounding and the deadzone quantizers are symmetric around 0. This doesn’t necessarily need to be the case. In symmetric quantization we form the quantization range by taking the maximum absolute value of the inputs and making the range symmetric around zero, as is the case with the output of the quantization operator (Figure 2 (a)). In asymmetric quantization, we set the domain of Q(r) to exactly match the min/max values of its range. In other words, the minimum value of the input directly maps to the minimum value of the quantized range, and so is the case with the maximums. This is achieved by using introducing an offset, also known as the zero-point, which is implemented by injecting an extra term z ≠ 0 to the quantization operator:
Q(r) = Δ ⋅ round( r / Δ + z)
We can clearly see that symmetric quantization is a special case of asymmetric quantization with the zero point set to 0. When z ≠ 0, it skews the rounding away from the center, making the quantizer asymmetric.
There are clearly tradeoffs between symmetric and asymmetric quantization, hence both approaches are often used in different model architectures. Symmetric quantization is simple to implement and provides equal resolution on both sides of zero. On the other hand, it may not be optimal for inputs where the distribution of values is not centered around zero or where certain ranges of values are more critical2. Asymmetric quantization, on the other hand, can be tailored to better fit the actual distribution of the signal values and is more effective for signals with non-zero means. However, it may be more difficult to implement and more computationally intensive and can also introduce uneven quantization error, leading to more distortion in certain signal ranges.
Deciding on the specifics around scaling and mapping the range of values in the process of quantisation can be a daunting task. The process of determining optimal ranges for each tensor in order to minimize the loss of precision and maintain model accuracy is known as “calibration”. There are generally two ways of performing calibration. We can use a representative dataset to analyze the distributions of activations and weights and use the observed ranges to compute scaling factors and zero-points (static calibration). Alternatively, ranges can be determined dynamically during inference based on the data seen at runtime (dynamic calibration). There are known tradeoffs around accuracy, flexibility, processing costs, and deterministic behaviour between static and dynamic calibration, which we need to be mindful about when designing the quantization pipeline [8].
How does quantization work in the context of neural networks
Quantizing a neural network follows the general signal quantization approach covered in the previous section: moving from a high-precision representation of the synaptic weight (typically 32-bit floating point or float32) to a lower precision data type. From the perspective of matrix-vector multiplication, a typical feedforward neural network with a single hidden layer can be defined as3:
y = W ⋅ x + b
Where y is the output of the network, x is the input, W is the weight matrix in the hidden layer, and b is a bias term. Such processing blocks can be combined to perform larger matrix-matrix multiplications and convolutions, which are characteristic of deeper and more complex neural networks. From a hardware perspective, an implementation of this block leverages processing elements and accumulators and is typically defined as:
Aₙ = bₙ + ∑ₘ Cₙ,ₘ
This operation is known as Multiply-Accumulate (MAC). It works by first loading the bias term bₙ in accumulator Aₙ, followed by the matrix-vector multiplication Wₙ,ₘ ⋅ xₘ, initially computed by the processing elements Cₙ,ₘ and also added to the accumulator [4].
Neural networks are usually trained using float32 data type for their weights and biases. Clearly, the accumulator should be sufficiently large and also provide floating point precision to be able to store the results of the input multiplication and bias addition. If the MAC operations take place on a dedicated accelerator (i.e. GPU) we also need to factor in the data transfer operations from memory to the processing elements. Using lower precision data types like int8, which only stores whole numbers in the range [-263-1, 263-1], speeds up the MAC operations and simultaneously reduces the transfer times [5].
The two most common approaches are to quantize the network weights from float32 to float16 or from float32 to int8. However, we need to be mindful that the data type of the accumulator should be sufficiently large to prevent overflows (e.g., the sum of two int8 values can easily exceed the range of the int8 data type). Therefore, it is a common practice to extend the precision of the accumulator like so:
- for float16 processing elements, the accumulator type is set to float16
- for bfloat16 processing elements, the accumulator type is set to float32
- for int16 processing elements, the accumulator type is set to int32
- for int8 processing elements, the accumulator type is set to int32
The float32 to int8 quantization may be challenging, due to the sheer narrowness of int8 - we have to compact a very wide floating point range into only 256 integer values. In the case of symmetric quantization the integer range is even smaller as it is reduced to [-127, 127] (i.e. -128 is omitted for the purpose of symmetry). To address this we typically select a quantization interval from the float32 data type and we use this as the domain of Q(R).
Values that fall outside of the domain are typically handled using clipping [6,7], which replaces any number that falls outside of the interval with the smallest or largest representable value from the domain. The clipping process inevitably impacts the model performance, so it also needs to be used with care.
Quantization for neural networks can take one of the following three forms:
- Post-Training Quantization (PTQ): this is the most widely adopted scheme, where quantization is performed after the neural network has been fully trained. At this point quantization is applied to the fitted model parameters, typically using int8 as the lower precision target.
- Quantization-Aware Training (QAT): this technique applies quantization during the training process itself. The advantage with this approach is that the model is trained to account for the quantization process, which typically leads to better performance as the model learns to adapt to the quantization effects.
- Dynamic Quantization : in this mode, quantization is applied to the activations during the inference phase. The key benefit of dynamic quantization is that it can determine the quantization factor Δ dynamically based on the data range observed during inference. This ensures that the quantization operator is tuned in a way that retains as much signal as possible and minimizes the loss of information.
How can Databricks facilitate quantization?
Let’s look at a simple example of performing post-training quantization in Databricks. Due to the flexibility of this type of model compression, many open source models are available in various quantized forms on popular public sharing platforms such as Huggingface. In particular, user TheBloke is a prolific contributor of models quantized via GPTQ [9]. These models can be easily imported, logged, and served using Databricks and MLFlow, thanks to the mlflow-extensions package.
However, where quantized versions of models are not publicly available, for example when one is fine-tuning their own models, there are several popular libraries available for applying GPTQ. We will demonstrate below how to use this method by leveraging the Python library AutoGPTQ. In the process, we will extend the dbdemos.ai example on extracting drug names. The code used here is run on a single (p3.2xlarge) V100 GPU with DBR 15.4 (LTS), and we will use Llama 3.2 1B instruct as the base model.
Step one is to fine tune the model, in this case, we use the Databricks Foundational Model Fine-tuning API:
from databricks.model_training import foundation_model as fm
run = fm.create(
data_prep_cluster_id = get_current_cluster_id(),
model=base_model_path,
train_data_path=f"{catalog}.{schema}.ner_chat_completion_training_dataset",
eval_data_path=f"{catalog}.{schema}.ner_chat_completion_eval_dataset",
task_type = "CHAT_COMPLETION",
register_to=registered_model_name,
training_duration='10ep'
)
The fine tuned model can then be accessed from the experiment and loaded directly into the cluster:
ft_mdl_pipe = mlflow.transformers.load_model(f'models:/{registered_model_name}/1')
We will compare the performance of the baseline model, the fine tuned model, and the quantized fine tuned model in terms of inference time and extraction accuracy.
AutoGPTQ providers wrapper classes around the transformer’s AutoModelForCausalLM class, so loading the model is as simple as:
qnt_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained(local_path + 'model', quantize_config=qnt_config).to(device)
As evident from the code snippet above, in this case we are using 4-bit quantization. AutoGPTQ also allows us to provide calibration text, and in this case we can leverage training data from the fine tuning run to calibrate the quantization. From there AutoGPTQ provides the “quantize” method to run the quantization and the quantized model can then be saved.
calibration_json = (
spark.table("ner_chat_completion_training_dataset").limit(3).toJSON().collect()
)
calibration_texts = [json.loads(c)["messages"] for c in calibration_json]
examples = []
for item in calibration_texts:
text = tokenizer.apply_chat_template(
item, tokenize=False, add_generation_prompt=False
)
inputs = tokenizer(text, return_tensors="pt")
examples.append(inputs)
model.quantize(examples)
Comparing the results post-quantization, we see that the F1 score remains consistent from the fine tuned to the quantized model (increasing to 0.9 from 0.78 for the base model). At the same time the speed of inference increases significantly.
Databricks applies some quantization to model outputs after fine-tuning; hence, the fine-tuned model runs inference faster than the base model. However, quantization to 4-bit further reduces this, and inference is approximately 30% faster than the base model.
The results support our expectation that quantized models speed up inference and, if done right, the loss of model performance should be minimal.
Summary
In this article we looked at model quantization, a transformative approach for enhancing the efficiency of large language models (LLMs) during inference. By reducing the precision of model weights, quantization significantly decreases memory usage and computational overhead while maintaining near-original performance. The Databricks platform offers robust tools to facilitate this process, leveraging its infrastructure for tasks such as post-training quantization and fine-tuning.
We included a detailed example demonstrating the use of Databricks to fine-tune a model, apply 4-bit quantization with the AutoGPTQ library, and achieve faster inference speeds with consistent accuracy metrics. The results revealed up to a 30% improvement in inference speed compared to the base model without performance degradation, underscoring the potential of quantization for scalable, efficient LLM deployment on Databricks.
Model quantization has become a game-changer in edge AI applications, where low-power, low-latency performance is crucial [10, 11]. In such systems, quantization reduces the size of the model, allowing them to execute efficiently on specialized hardware like GPUs or FPGAs. It enables faster response times and extends battery life without sacrificing critical accuracy. Another major application is in large-scale cloud services that serve millions of requests per second, such as recommendation systems, search engines, and content delivery platforms. By deploying quantized models, cloud providers can significantly lower computational costs and energy consumption.
While model quantization offers significant benefits, such as faster inference and reduced memory usage, it comes with a few notable drawbacks. One major concern is the potential loss of accuracy, as reducing parameter precision can degrade model performance, especially in sensitive applications like medical diagnosis. Additionally, not all hardware fully supports low-precision operations, limiting deployment options and requiring extra engineering effort. Certain quantization techniques, like quantization-aware training, may also necessitate retraining, adding complexity and increasing development time [12, 13]. Despite these challenges, ongoing advancements continue to improve quantization’s viability across diverse applications.
For those eager to explore further, the blog Serving Quantized LLMs on NVIDIA H100 Tensor Core GPUs dives deep into how quantization approaches, particularly using NVIDIA H100 GPUs, significantly enhance performance metrics like throughput and latency, while maintaining model quality.
The complete code for the fine-tuning and quantization example is available here: https://github.com/nmanchev/databricks-blog/tree/main
Notes
[1] or countably infinite
[2] An example of this is the quantization of the outputs of the ReLU function. The range of the ReLU function is completely biased towards one side, but a symmetric quantization operator will dedicate a significant quantized range for values that will never appear.
[3] We omit the nonlinear activation in this expression for the purposes of simplicity.
References
[1] Shannon, Claude Elwood (July 1948). "A Mathematical Theory of Communication”. Bell System Technical Journal. 27 (3): 379–423. doi:10.1002/j.1538-7305.1948.tb01338.x. hdl:11858/00-001M-0000-002C-4314-2.
[2] Sayood, Khalid (2012) Introduction to Data Compression, Fourth Edition (4th. ed.), Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
[3] Tu, Yung-Ping et al.(2024) “A Novel Alternating μ-Law Companding Algorithm for PAPR Reduction in OFDM Systems”. Electronics
[4] Nagel, Markus et al. (2021) “A White Paper on Neural Network Quantization”, ArXiv, https://doi.org/10.48550/arXiv.2106.08295
[5] M. Horowitz, "1.1 Computing's energy problem (and what we can do about it)", 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 2014, pp. 10-14, doi: 10.1109/ISSCC.2014.6757323.
[6] C Sakr, S Dai, R Venkatesan, B Zimmer, W Dally, B Khailany (2022), “Optimal clipping and magnitude-aware differentiation for improved quantization-aware training”, International Conference on Machine Learning, 19123-19138
[7] C. Liqun and H. Lei (2023). "Clipping-based Neural Network Post Training Quantization for Object Detection", 2023 IEEE International Conference on Control, Electronics and Computer Technology (ICCECT), Jilin, China,, pp. 1192-1196, doi: 10.1109/ICCECT57938.2023.10141287.
[8] Ahn, Hyunho, et al. Performance Characterization of Using Quantization for DNN Inference on Edge Devices, 2023 IEEE 7th International Conference on Fog and Edge Computing (ICFEC). IEEE, 2023.
[9] E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, The Eleventh International Conference on Learning Representations (ICLR), 2023
[10] Giazitzis, S., Sakwa, M., Leva, S., Ogliari, E., Badha, S., & Rosetti, F. (2024). A Case Study of a Tiny Machine Learning Application for Battery State-of-Charge Estimation. Electronics, 13(10), 1964. https://doi.org/10.3390/electronics13101964
[11] Zhuo, S., Chen, H., Ramakrishnan, R. K., Chen, T., Feng, C., Lin, Y., ... & Shen, L. (2022). An empirical study of low precision quantization for TinyML. arXiv preprint arXiv:2203.05492.
[12] Bondarenko, Y., Nagel, M., & Blankevoort, T. (2021). Understanding and overcoming the challenges of efficient transformer quantization. arXiv preprint arXiv:2109.12948
[13] Roy, S. (2023). Understanding the Impact of Post-Training Quantization on Large-scale Language Models. arXiv preprint arXiv:2309.05210.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.