Table of Contents
1. Limitations of standard product quantization
2. Anisotropic loss function (mathematical deep dive)
3. Python implementation4. Results
5. Next6. References
Recently I was asked to implement this paper during an interview. Although I had read the paper a few years ago, I couldn’t complete an implementation in time. So I decided to re-read it and go beyond a high-level understanding: examine the limitations of previous methods, understand the mathematical derivations of the theorems, implement it in Python, and discuss the limitations and caveats
This is a deepdive so if you want to have a high level understanding before diving deep, refer to [Paper review] Accelerating Large-Scale Inference with Anisotropic Vector Quantization which is a really good summary by data scientist named Haneul Kim.
The code will be posted in my github.
We will be using Glove1.2M with embedding dimension of 50 [4].
Author states that GloVe dataset was originally designed to be used with cosine similarity metric therefore we unit-normalize the item embeddings to make inner product equivalent to the cosine of the angle between vectors.
Let’s intialize our variables. I will explain η(eta) = 2.04, instead of 4.125 set by the paper.
Split vector space into subspaces.
Now for each subspace we will learn a codebook(optimal cluster) and different ways of creating this codebook differentiates various quantization methods.
1.Limitations of standard product quantization
In order to understand why authors of the paper proposed new Anisotropic loss function, we must first understand limitations of standard product quantization(SPQ).
Note: “Anisotropic” basically means, varying in magnitude according to direction of measurement. Therefore we can think of SPQ where it uses L2 loss to learn codebook as Isotropic loss function.
Quick recap: SPQ splits high-dimensional vector into continguous smaller subvectors and then quantize(cluster using K-means) each subvector independently.
SPQ works optimally under two assumptions.
1. Independence: The subspaces are mutually independent
2. Balance: The total variance within each subspace is balanced
Let’s break it down why:
- The Problem of Correlated Dimensions: If dimensions that are highly correlated with each other end up in different subspaces, the subspaces are no longer independent. This makes the quantization process inefficient as it fails to exploit this redundancy, forcing the quantizer to waste resources encoding the same information multiple times and increasing the overall error (distortion). For example, for 16-D subspace where dimension_1 = c *dimension_19, d_1 and d_19 are in independent subspace => quantizer does not know they are redundant, wasting resource to encode both.
- The Problem of Imbalanced Variance: While clustering naturally emphasis high-variance dimensions, our goal is to learn optimal quantized vectors, so we should balance quantization across all subspaces. If one sub-quantizer is assigned a subspace with extremely high variance while another gets a low-variance subspace, the first has a much harder task and will produce a large error, while the second has an easy one.
Therefore preprocessing step should be added to satisfy two assumptions. Here are some methods:
- Linear transformation: Introduce orthogonal matrix to transform all dimensions orthogonal. [3]
- Dimensionality reduction: But must be done with caution since methods like PCA results in different variance across pricinpal components.
- Naive selection: Basically, selecting vectors that should go in same subspace.
There are multiple methods to make sure vectors satisfy two assumption however since SPQ is not the focus of this blog, let’s just stick to vanilla SPQ.
I’m pretty sure that everyone is aware of how K-Means works, but basically it initializes k centroids, assigns each data point to a centroid, recomputes the distance to the centroid, and then adjusts the centroid. Since it measures Euclidean distance, it learns by minimizing the squared Euclidean distance (L2) loss function.
2.Anisotropic loss function (mathematical deep dive)
Another limitation of SPQ is that its objective minimizes reconstruction error which is formulated as ∑||xi − x̃i||² (squared euclidean distance between vector and quantized vector).
As you can see it treats all points equally, which is not true in retrieval task. For example in recommender system or relevant search, only top-k are shown to the user due to space constraints. This observation suggest that we should focus on items with high potential (i.e., those with high inner product). The only scenario where minimizing reconstruction error is theoretically equivalent to optimizing for MIPS is under the strict assumption that query vectors are isotropic, however this is hardly true. Many embeddings originate from a highly structured, non-uniform distribution where semantically similar words are clustered together.
To mitigate this, author propose score-aware quantization loss that is robustly aligned with MIPS objective.
It’s important to note that while the general form of this loss is flexible, it requires assumption that queries come from uniformly distributed in unit-sphere to make it computationally tractable.
- loss function is expectation over all queries in query distribution Q => Integral over all possible query vectors
q
. The challenge is to solve this integral to get a formula that doesn’t require us to loop through all possible queries during training - w(〈q, xi〉): weight function that takes in inner product.
To make computation tractable, authors apply some beautiful mathmatical tricks to simplify the loss function to η*||r‖||² + ||r⊥||². This simplification, backed by mathematical proof is when science turns to art.
Here are the steps:
step 1: transform score-aware loss function into closed-form with two components, parallel and orthogonal (anisotropic loss function).
The paper proves lemma 7.1. that shows what the expected error is, given that the inner product 〈q, xi〉
has a specific value t
. This is where the split happens. The proof for this lemma works by decomposing the query vector q
itself into components that are parallel (q‖
) and orthogonal (q⊥
) to the vector xi
. and h‖
and h⊥
are scalar weights derived from complex integrals that depend on the weighting function w
and the norm of the data point xi
.
Because authors assumes queries are uniform and spherically distributed, these components have predictable properties that allow the expected value to be calculated, resulting in the expression above which is a weighted sum of the squared norms of the parallel and orthogonal residual errors.
We substract vectors xi by x̃i(quantized) to get the residual vector. In graph below, I’ve denoted x̃1 as c1 (I find it more intuitive to think in terms of cluster centroids)
By hypothetically rotating the triangle
we can we see r‖
is indeed projection onto original vector x.
Orthogonal residual error, r⊥
is defined as what is “left over” from the total error once the parallel component has been accounted for.
r⊥(xi , x̃i) = (xi − x̃i) − r‖(xi , x̃i).
step 2: Simplify
We know h‖
and h⊥
are scalar weights derived from complex integrals.
The paper aims to re-parameterize the loss from being a sum with two distinct weights (h‖
and h⊥
) into a simpler form where the relative importance of the two components is captured by a single ratio. This allow us to eliminate complex integrals.
From our previous loss function, we factor out h⊥
. By doing that we get h‖/h⊥
. Since both are constants, define η = h‖/h⊥
.
The goal is to find best quantizer x̃i that minimize loss function (f(x)). Minimizing f(x) and c*f(x) where c is positive constant converge to same point. This is crucial observation and should be remembered by all researchers (especially me). Because of this proportionality, we can let h⊥ = 1.
This is final form of loss function that will be minimized
Loss ∝ η * ||r‖(xi, x̃i)||² + ||r⊥(xi, x̃i)||²
3.Python implementation
For each subspace, we randomly choose centroids from our data points. You can also use k-means or other clustering algorithms to create better initialization points.
(Figure 3) In left most matrix, 0th axis represents number of items (N) and 2th axis represents dimension of subvector(d).
Figure 3. is simply a visualization of the matrix operations to make them clearer.
We subtract each subvector from each of the 8 different cluster centroids (the codebook that quantizes the vectors), resulting in a residual matrix of shape (10000, 8, 5).
- Residual_matrix[0, 0, :] is residual vector from 0th vector to 0th cluster.
The loss function code is straightforward as we already went over mathematical details.
x_norm_sq[x_norm_sq ==] = 1.0
is used to avoid division by zero for zero-norm vectors. Our numerator,inner_prod
, will be 0 if x is a zero vector. Therefore, when we divide by x_norm_sq
, a zero vector should result in 0. In order to do that, we set x_norm_sq
to 1 when it is 0.
Take the inner product
Then compute projection.
- projection_matrix[0, 0, :] represents residual vector projected onto original vector
- projection_matrix[0, 1, :] represents residual vector projected onto original vector
- and so on…
Now we compute length of projection vector by computing squared norm, resulting in (10000, 8, 1) matrix. r_orthogonal_sq_norm
has same shape.
We return η * ||r‖(xi, x̃i)||² + ||r⊥(xi, x̃i)||².
η(eta) = 2.04 is result of computing theorem 3.4. Since we unit-normalized the vectors, their norms are 1. Setting T = 0.2, we compute η as:
η = (T/1)² / (1− (T/1)²)*(D-1) = (0.04)/0.96*49 = 2.04.
Now we use losses to assign subvector to nearest centroid(now subvector has been quantized). This time instead using euclidean distance from point to centroid, we used score-aware distance.
Just like k-means, adjust centroid(codebook) using newly assigned points.
Putting it all together
Using above code, wecan see our loss converges for all subspaces.
So, Now we can encode each subvectors then merge it to make final quantized vector. These quantized vector will be used for ANN search.
Note that we not only reduce the number of searches but also save memory by a factor of 10. Since we set n_clusters
to 8, in theory we could use 3 bits (if using unsigned integers) to cover the range [0, 8). However, because modern CPUs operate on bytes (8-bit units) as the smallest addressable chunk of memory, we stick to using 8-bit integers.
4. Results
Ground truth will be nearest vector retrieved using exact search.
We can see training codebook with anisotropic loss function outperforms standard product quantization at both Recall@K and Recall 1@K.
5. Next
- Correlated Dimension: While this paper reframes the quantization objective to be score-aware it does not address the problem of correlated dimensions. One improvement would be to combine this work with preprocessing techniques from Optimized Product Quantization [3]. By first learning an optimal rotation to decorrelate subspaces and balance their variance then applying anisotropic loss function for training.
- Adaptive η: The loss function introduces a new hyperparameter η, and training is very sensitive to its value. Exploring an adaptive η, learned per subspace, could improve stability and performance.
- Data-Driven Query Distribution Models: replace the uniform query assumption entirely. By modeling the empirical query distribution with techniques like Gaussian Mixture Models (GMMs) or normalizing flows, we could derive a loss function tailored to real, non-uniform query patterns.
Finally, I think quantization is a very useful tool to have in your arsenal. Embeddings are widely used across domains, they are trained and applied in information retrieval, recommendations, computer vision, and especially large language models. As datasets continue to grow, techniques like approximate nearest neighbor search and quantization, which save memory and accelerate computation without significant loss of information, will become essential.