Making AI Lighter and Smarter: The Changes Brought by 'Asymmetric Quantization'

AI Summary

This article explains how 'asymmetric quantization,' a data compression technique, can drastically reduce AI model storage requirements while maintaining high information accuracy.

Imagine this: You are searching through tens of thousands of documents on your smartphone, and an AI finds the answer in the blink of an eye. What if I told you that this AI uses 32 times less data than before? It is like a technology that compresses the contents of a vast library into a single thin sheet of paper without any loss of information. Today, I am introducing a magical technology called “Asymmetric Quantization” that drastically reduces capacity while maintaining the core of AI intelligence.

Why Is This Technology Important?

Recently, AI models have been growing to enormous sizes. As models become smarter, the amount of information they contain has become vast. However, this also means that users’ smartphones and corporate servers require massive amounts of storage space. For example, it would be inefficient if a device designed to handle data for 100 people could only barely fit data for one.

This technology allows AI to be used more freely in everyday small devices. Reduced storage requirements also mean lower operating costs. Consequently, this lays a solid foundation for smart devices around us to possess smarter AI features even without an internet connection. Source 12

Easy Understanding: How to Put Data on a Diet

“Quantization,” in simple terms, is similar to downscaling a high-resolution photo to a lower resolution while keeping it as close to the original as possible. Simply put, it is the task of changing data that was represented by very precise and complex 32-bit numbers into simpler numbers, like 8-bit. Source 15

While traditional “symmetric quantization” groups numbers around a set reference point, “asymmetric quantization” acknowledges that this reference point can be skewed. By analogy, it is like adjusting the brightness of a photo by setting the darkest and brightest points individually to preserve fine details. This technology separately stores block scales and offsets (reference point correction values), allowing it to preserve the subtle differences in data much more precisely while reducing the numbers. Source 8, Source 13

In document retrieval systems, this approach is even more dramatic. The “query vector” that allows the AI to understand the question is kept very precise, while the “document vector,” which is the target of the search, is stored as very simple “binary symbols” (combinations of 0s and 1s). By doing this, document storage space is reduced by 32 times while maintaining nearly the same search accuracy. Source 11

Where Do We Stand Now?

Currently, asymmetric quantization is being used as a practical tool to maximize the efficiency of AI models. In Large Language Models (LLMs), this technology is primarily applied to the model’s “activation” layers (data in the intermediate processing stages where the model handles input information). This is because applying it to activations provides a more distinct performance improvement than applying it to weights (the model’s basic knowledge). Source 5

In fact, models using asymmetric quantization technology are reducing storage capacity by up to 97% compared to the original, while maintaining information accuracy at a level that is essentially indistinguishable to humans. Source 12, Source 13

What Will the Future Look Like?

AI will continue to become lighter and faster. We are approaching an era where our smartphones, laptops, and even home appliances will feature much smarter AI than they do now. Technologies like asymmetric quantization will accelerate the “normalization of AI,” moving it out of massive servers beyond the internet clouds and into the small devices in our hands. As AI models become lighter, the technology will become more familiar and useful.

MindTickleBytes’ AI Reporter View

No matter how smart a technology is, it is useless if it is too heavy to use. Asymmetric quantization is a clever strategy for killing two birds with one stone: AI “intelligence” and “efficiency.” From now on, the core benchmark in AI competition will shift from simply “how big the model is” to “how efficiently it compresses and utilizes information.”

References

Statistically-Lossless Quantization of Large Language Models
A Comprehensive Evaluation on Quantization Techniques for Large Language Models
Asymmetric Deep Semantic Quantization for Image Retrieval
[1903.12493] Asymmetric Deep Semantic Quantization for Image Retrieval
A Comprehensive Evaluation on Quantization Techniques for Large Language Models
Reducing Storage of Pretrained Neural Networks by Rate- …
Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct
Towards 10 Million Context Length LLM Inference with KV …
AsymmetricQuantization:Near-LosslessLateinteractionRetrieval…
[AsymmetricQuantization:Near-LosslessRetrieval… HackerNews](https://news.ycombinator.com/item?id=48724127)
AsymmetricQuantizationTechniques
[LLMQuantizationGuide: Run 70B Models… Space Services Research](https://spaceservices.org/learn/llm-quantization-compression)
A Visual Guide toQuantization- by Maarten Grootendorst

Share this article:

Test Your Understanding

Q1. What is an advantage of asymmetric quantization over traditional quantization methods?

It blindly deletes data
It uses asymmetric offsets to reduce information loss
It increases storage requirements

Asymmetric quantization uses offsets to preserve information more precisely, thereby reducing loss.

Q2. What is a benefit of applying this technology to document retrieval systems?

Search speed becomes 100 times slower
Storage space can be saved by up to 32 times
Accuracy drops to zero

Document vectors are compressed into binary symbols, saving storage space by up to 32 times.

Q3. Where is asymmetric quantization primarily applied in LLMs?

Mainly in activation layers
In the hardware devices themselves
In network cables

It is primarily applied to activations rather than weights, as it yields greater performance improvements when applied to the activation process.