A Gentle Introduction of NTT  Part III: The Kyber Trick
15 Dec 2023In the previous post, we have seen that NTT is a special way of evaluating a polynomial (i.e. converting a polynomial from coefficient form to evaluation form) such that the evaluation points are all the $d$th roots of unity. We then examined a few efficient ways (recursive and iterative) to implement NTT and iNTT routines. Lastly we learned that by carefully picking the right twiddle factors (generators) as roots of the polynomial modulus, we can perform modular polynomial multiplication in NTT form in the same degree as the multiplicand (without the need to extend the degree and perform additional reduction).
Putting everything we’ve learned so far, we can put together an efficient algorithm for multiply polynomials in a ring in $O(n \log{n})$ time. The recipe is as simple as:

Perform NTT on polynomial $P, Q$ with the roots of the polynomial modulus $\phi$ as twiddle factors and obtain $\hat{P}, \hat{Q}$. Note that this requires the modulus to have a special structure such that there exists a generator that generates all of its roots.

Perform coordinatewise multiplication on $\hat{P}, \hat{Q}$ and obtain $\hat{R}$.

Perform Inverse NTT on $\hat{R}$ and recover the final multiplication result $R = P \times Q$.
NWC, revisited
From the previous post, we also saw that there are two commonly used cyclotomic polynomial moduli in NTT  $x^d  1$ that corresponds to circular convolution (CC), and $x^d + 1$ that corresponds to negativewrapped convolution (NWC).
In this post, we will continue our discussion on the NegativeWrapped Convolution (NWC).
One of the key speedup for polynomial multiplication brought by NTT is that, if the evaluation points happen to be the roots of the polynomial modulus (in our case, $x^d \pm 1$), then the multiplication result is automatically reduced to the target degree, and hence we don’t have to double the length/degree of the polynomial we are multiplying to preserve the higher degree coefficients for the reduction.
In the case of $x^d  1$, we found out that the distinct powers of a $d$th primitive root of unity $\omega_d^i$ (or alternatively, the set of field elements generated by the generator $\omega_d$) also happens to be all the roots for the polynomial modulus:
\[[\omega_d^0, \omega_d^1, \omega_d^2, \dots, \omega_d^\frac{d}{2}, \omega_d^{\frac{d}{2} + 1}, \dots, \omega_d^{d  1}]\]How we obtained this result is through recursively splitting the $x^d  1$ term, and applying the fact that $\omega_d^\frac{d}{2} \equiv 1$ to convert the addition term after the split back into a subtraction term so we can split further. For example, $x^\frac{d}{2} + 1 \rightarrow x^\frac{d}{2}  \omega_d^\frac{d}{2}$. We can split totally $\log{d}$ times and end up with $d$ unsplittable degree1 terms in which we can easily extract the roots of the original polynomial from. As seen from the last post, we can represent the splitting operation as a binary tree shown below (ignoring the $\mathbb{Z}_q[x]/$ part of the expression and $n = d$):
Now moving to the case of NWC. We perform the same kind of binary split and result in the following tree ($\omega_d = \phi_{2d}^2$):
One problem with the same approach that we pointed out last time is that, starting from the second layer from the top of the tree, the degree of the $\omega$ term on the right is always one half of the degree of the $x$ term on the left. This means that we will run out of $\omega$ terms to split with before we have fully splitted the $x$ terms down to degree1. As seen on the second layer from the bottom of the tree, $x^2  \omega_d$ cannot be broken down further, until the point we introduced a finer generator $\phi_{2d}$ which can help us split $\omega_d$ one more time. With $\phi_{2d}$ as the new $2d$th root of unity, we found the roots of the original polynomial, which are distinct odd powers of the new generator.
Performance Penalty
The roots for the NWC case indeed work for our polynomial multiplication task. However, we are putting more constraints on the properties of the underlying field $\mathbb{Z}_q$: to make sure everything is welldefined, we need to make sure that the $2d$th primitive root of unity actually exists in the field. So far, we’ve kind of taken it for granted assuming that $\omega_d$ exists in $\mathbb{Z}_q$. However, it doesn’t necessarily mean that we can also find the $2d$th one.
For example, the field $q$ used in Kyber is 3329, in which number 17 is a qualifying 256th primitive root of unity (one can easily check that $17^{256} \equiv 1 \text{ mod } 3329, 17^{128} \equiv 1 \text{ mod } 3329$). Sadly, the 512th one doesn’t exist. If we happen to work in the NWC polynomial modulus, we can only perform degree128 polynomial multiplications! If we really want to do degree256 polynomial multiplications, we will have to make the underlying field much larger. The smallest field that gives us a 512th primitive root of unity is 12289 (with number 3 as an example generator such that $3^{512} \equiv 1 \text{ mod } 12289, 3^{256} \equiv 1 \text{ mod } 12289$), which is the field used in another key encapsulation system called NewHope.
The bump from 3329 to 12289 to accommodate degree256 polynomial multiplication can cause potential performance impacts. Sice $3329 \approx 2^{11.7}$ and $12289 \approx 2^{13.6}$, we are looking at a 2bit increase of the representation of the underlying field size, and a 4x timing/memory degrade if we have to precompute things in the field and store them offline.
This naturally brings us to the next big question: Can we do better?
Just split something else
The short answer to the question is: Yes, but not without paying some price.
To avoid paying the extra cost in the field, we will just have to figure out a way to work without the assumption of the existence of $\phi_{2d}$. Now we are back in the binary split tree, and stuck at the layer where we are left with a term that looks like $x^2  \omega_d$.
To get us unblocked, here’s some life advice: If you cannot have what you want, you should want what you can have.
Halving the dimensions
If the best thing we can have is the existence of a primitive $d$th root of unity $\omega_d$, then using the same NTT with NWC approach, we will see what we can do with it.
Here is a simple idea: let’s constrain everything we work with to only be $\frac{d}{2}$degree polynomials, and define the NTT operation on them. Because we only have $\frac{d}{2}$ coefficients, we only need half the evaluations. Since we know $\omega_d$ already, we can easily set $\omega_\frac{d}{2} = \omega_d^2$ to be the new $\frac{d}{2}$th primitive root of unity, and conversely $\phi_d = \omega_d$. We call this operation the halfNTT, as it operates on input vectors half the size of our intended targets.
Now, applying everything we’ve known previously, we can now do degree$\frac{d}{2}$ polynomial multiplication coordinatewise without reductions with $x^\frac{d}{2} + 1$ as the polynomial modulus. Effectively, we halve the size of $d$ to make sure the $2d$th primitive root of unity exists in the field.
Split the target polynomial
How can we make use of the newly defined halfNTT procedures? Here is another intuition: can we break down polynomial multiplications in degree$d$ into several polynomial multiplications in degree$\frac{d}{2}$?
Take a degree4 polynomial $P$ as an example:
\[\begin{align*} P(x) &= p_3 x^3 + p_2 x^2 + p_1 x + p_0 \\ &= (p_2 x^2 + p_0) + x (p_3 x^2 + p_1) \\ &= P^\text{even}(x^2) + x P^\text{odd}(x^2) \end{align*}\]We see that we can reorder the terms and factor the common parts and essentially interpret a polynomial as a combination of a polynomial with only even coefficients and a polynomial with only odd coefficients. This is very similar to the FFTlike recurrence relation trick that we did in the previous post to speed up NTT.
The important part is that, since each of the smaller polynomials $P^\text{even}, P^\text{odd}$ has only half of the coefficients of $P$, we know that it has degree $\frac{d}{2}$. Therefore, our halfNTT procedure would work on them!
Let’s now set up the halfNTT procedure programmatically. We first import the efficient iterative NTT/iNTT implementations from the previous post. We also directly work in the Kyber field $q = 3329$ to demonstrate how it works IRL.
# We use the Kyber modulus
MODULUS = 3329
GEN = 17
GEN_INV = pow(GEN, 1, MODULUS)
# 17 is a primitive 256th root of unity of modulus 3329.
# Therefore, we will be working with polynomials with length 256 (degree 255).
assert pow(GEN, 256, MODULUS) == 1
assert pow(GEN, 128, MODULUS) != 1
# We import the NTT routines from last post.
# BEGIN: REFERENCE
import math
def brv(x, n):
""" Reverses a nbit number """
return int(''.join(reversed(bin(x)[2:].zfill(n))), 2)
def ntt_iter(a, gen=GEN, modulus=MODULUS):
deg_d = len(a)
# Start with stride = 1.
stride = 1
# Shuffle the input array in bitreversal order.
nbits = int(math.log2(deg_d))
res = [a[brv(i, nbits)] for i in range(deg_d)]
# Precompute the generators used in different stages of the recursion.
gens = [pow(gen, pow(2, i), modulus) for i in range(nbits)]
# The first layer uses the lowest (2nd) root of unity, hence the last one.
gen_ptr = len(gens)  1
# Iterate until the last layer.
while stride < deg_d:
# For each stride, iterate over all N//(stride*2) slices.
for start in range(0, deg_d, stride * 2):
# For each pair of the CT butterfly operation.
for i in range(start, start + stride):
# Compute the omega multiplier. Here j = i  start.
zp = pow(gens[gen_ptr], i  start, modulus)
# CooleyTukey butterfly.
a = res[i]
b = res[i+stride]
res[i] = (a + zp * b) % modulus
res[i+stride] = (a  zp * b) % modulus
# Grow the stride.
stride <<= 1
# Move to the next root of unity.
gen_ptr = 1
return res
def intt_iter(a, gen=GEN, modulus=MODULUS):
deg_d = len(a)
# Start with stride = N/2.
stride = deg_d // 2
# Shuffle the input array in bitreversal order.
nbits = int(math.log2(deg_d))
res = a[:]
# Precompute the inverse generators used in different stages of the recursion.
gen = pow(gen, 1, modulus)
gens = [pow(gen, pow(2, i), modulus) for i in range(nbits)]
# The first layer uses the highest (dth) root of unity, hence the first one.
gen_ptr = 0
# Iterate until the last layer.
while stride > 0:
# For each stride, iterate over all N//(stride*2) slices.
for start in range(0, deg_d, stride * 2):
# For each pair of the CT butterfly operation.
for i in range(start, start + stride):
# Compute the omega multiplier. Here j = i  start.
zp = pow(gens[gen_ptr], i  start, modulus)
# GentlemanSande butterfly.
a = res[i]
b = res[i+stride]
res[i] = (a + b) % modulus
res[i+stride] = ((a  b) * zp) % modulus
# Grow the stride.
stride >>= 1
# Move to the next root of unity.
gen_ptr += 1
# Scale it down before returning.
scaler = pow(deg_d, 1, modulus)
# Reverse shuffle and return.
return [(res[brv(i, nbits)] * scaler) % modulus for i in range(deg_d)]
def mul_poly_naive_q_nwc(a, b, q, d):
tmp = [0] * (d * 2  1) # intermediate polynomial has degree 2d2
# schoolbook multiplication
for i in range(len(a)):
# perform a_i * b
for j in range(len(b)):
tmp[i + j] = (tmp[i + j] + a[i] * b[j]) % q
# take polynomial modulo x^d + 1
for i in range(d, len(tmp)):
tmp[i  d] = (tmp[i  d]  tmp[i]) % q
tmp[i] = 0
return tmp[:d]
# END: REFERENCE
Next, we define the split procedure as well as the halfNTT trick.
GEN_POW = [pow(GEN, i, MODULUS) for i in range(256)]
GEN_INV_POW = [pow(GEN_INV, i, MODULUS) for i in range(256)]
def ntt_half_nwc(a, gen=GEN, modulus=MODULUS):
deg_d = len(a)
# Computes the new rootofunity.
gen2 = pow(gen, 2, modulus)
# Multiplies each coefficient with \phi_d (for the NWC ring)
aa = [a[i] * GEN_POW[i] for i in range(deg_d)]
# Computes halfNTT
a_ntt = ntt_iter(aa, gen2, modulus)
return a_ntt
def intt_half_nwc(a_ntt, gen=GEN, modulus=MODULUS):
deg_d = len(a_ntt)
# Computes the new rootofunity.
gen2 = pow(gen, 2, modulus)
# Undo halfNTT
a = intt_iter(a_ntt, gen2, modulus)
# Multiplies each coefficient with \phi_d (for the NWC ring)
aa = [(a[i] * GEN_INV_POW[i]) % modulus for i in range(deg_d)]
return aa
def split_poly(p):
deg = len(p)
even = [p[i] for i in range(0, deg, 2)]
odd = [p[i] for i in range(1, deg, 2)]
return even, odd
p = list(range(256))
p_even, p_odd = split_poly(p)
p_even_ntt, p_odd_ntt = ntt_half_nwc(p_even), ntt_half_nwc(p_odd)
p_even2, p_odd2 = intt_half_nwc(p_even_ntt), intt_half_nwc(p_odd_ntt)
assert(p_even == p_even2)
assert(p_odd == p_odd2)
Multiplying two decomposed polynomials
Now, let’s try to multiply two degree$d$ polynomials using this decomposition, and just see how far we can get:
\[\begin{align*} P(x) \cdot Q(x) &= (P^\text{even}(x^2) + x P^\text{odd}(x^2)) \cdot (Q^\text{even}(x^2) + x Q^\text{odd}(x^2)) \\ &= (P^\text{even} \cdot Q^\text{even})(x^2) + x(P^\text{even} \cdot Q^\text{odd} + P^\text{odd} \cdot Q^\text{even})(x^2) + x^2(P^\text{odd} \cdot Q^\text{odd})(x^2) \\ &= R(x) \text{ mod } (x^d + 1) \\ &= R^\text{even}(x^2) + x R^\text{odd}(x^2) \end{align*}\]We use $R(x)$ to represent the final result polynomial after the reduction, and also decompose it into even and odd subparts for comparison. By simply staring at the expression, we can come up with a solution for $R^\text{odd}$:
\[\begin{align*} R^\text{odd}(x) &= (P^\text{even} \cdot Q^\text{odd} + P^\text{odd} \cdot Q^\text{even})(x) \text{ mod } (x^\frac{d}{2} + 1) \\ \Rightarrow x R^\text{odd}(x^2) &= x(P^\text{even} \cdot Q^\text{odd} + P^\text{odd} \cdot Q^\text{even})(x^2) \text{ mod } (x^d + 1) \end{align*}\]We were able to deduce that because of the parity of the powers of $x$ here, as the other terms can only produce even powers of $x$. If the top expression holds, the bottom expression will also hold. It also appears that we can efficiently compute everything in this expression, as $P^\text{even/odd} \cdot Q^\text{odd/even}$ can be multiplied coordinatewise after halfNTT, and the addition is trivial.
We now define $R^\text{odd}(x)$ in the halfNTT domain, using $\tilde{R^\text{odd}}$ to represent the halfNTT of its vectorized coefficients \(\mathbf{NTT}^\text{NWC}_{\omega_\frac{d}{2}} (\mathbf{r}^\text{odd})_j\).
\[\begin{align*} \tilde{R^\text{odd}}_j &= \tilde{P^\text{even}}_j \cdot \tilde{Q^\text{odd}}_j + \tilde{P^\text{odd}}_j \cdot \tilde{Q^\text{even}}_j \end{align*}\]Below is the corresponding implementation.
def ntt_half_nwc_mul_odd(p_even_ntt, p_odd_ntt, q_even_ntt, q_odd_ntt, modulus=MODULUS):
# Coordinatewise multiplication of the cross terms.
peqo = [(i * j) % modulus for i, j in zip(p_even_ntt, q_odd_ntt)]
poqe = [(i * j) % modulus for i, j in zip(p_odd_ntt, q_even_ntt)]
# Coordinatewise addition.
return [(i + j) % modulus for i, j in zip(peqo, poqe)]
p = list(range(256))
p_even, p_odd = split_poly(p)
p_even_ntt, p_odd_ntt = ntt_half_nwc(p_even), ntt_half_nwc(p_odd)
q = [1]*256
q_even, q_odd = split_poly(q)
q_even_ntt, q_odd_ntt = ntt_half_nwc(q_even), ntt_half_nwc(q_odd)
r_odd_ntt = ntt_half_nwc_mul_odd(p_even_ntt, p_odd_ntt, q_even_ntt, q_odd_ntt)
r_odd = intt_half_nwc(r_odd_ntt)
# Verify result against textbook naive multiplication.
r_naive = mul_poly_naive_q_nwc(p, q, MODULUS, 256)
r_naive_even, r_naive_odd = split_poly(r_naive)
assert(r_odd == r_naive_odd)
Now, we subtract the equality from the original expression, and we can get a similar solution for $R^\text{even}$ as well:
\[\begin{align*} R^\text{even}(x) &= (P^\text{even} \cdot Q^\text{even})(x) + x(P^\text{odd} \cdot Q^\text{odd})(x) \text{ mod } (x^\frac{d}{2} + 1) \\ \Rightarrow R^\text{even}(x^2) &= (P^\text{even} \cdot Q^\text{even})(x^2) + x^2(P^\text{odd} \cdot Q^\text{odd})(x^2) \text{ mod } (x^d + 1) \end{align*}\]The first term $P^\text{even} \cdot Q^\text{even}$ can be done via halfNTT just like above. However, the second term is a little bit interesting: it is equivalent to say that the second term is the product of $P^\text{odd} \cdot Q^\text{odd}$ and a predetermined fixeddegree polynomial $K(x) = x$. Therefore, we can simply multiply them coordinatewise in halfNTT form.
Let’s take polynomial $F = P^\text{odd} \cdot Q^\text{odd} \text{ mod } (x^d + 1)$ and $\mathbf{f}$ as the vectorized coefficient representation of $F$. We inspect it in its halfNTT form:
\[\mathbf{NTT}^\text{NWC}_{\omega_\frac{d}{2}} (\mathbf{f})_j = \sum_{i=0}^{\frac{d}{2}1} f_i \psi_{d}^{i \cdot (2j+1)}\]where $\psi_d$ is a $d$th primitive root of unity, and $\omega_\frac{d}{2} = \psi_d^2$.
Now, let’s try to write the halfNTT form of $K$.
\[\begin{align*} \mathbf{NTT}^\text{NWC}_{\omega_\frac{d}{2}} (\mathbf{k})_j &= \sum_{i=0}^{\frac{d}{2}1} k_i \psi_{d}^{i \cdot (2j+1)}\\ &= \psi_{d}^{2j+1} \end{align*}\]We see that the halfNTT terms of $K$ are as simple as a series of powers of $\psi_d$! Now, let’s write down the coordinatewise multiplication of $F$ and $K$:
\[\begin{align*} \mathbf{NTT}^\text{NWC}_{\omega_\frac{d}{2}} (\mathbf{k})_j \cdot \mathbf{NTT}^\text{NWC}_{\omega_\frac{d}{2}} (\mathbf{f})_j &= \psi_d^{2j+1} \mathbf{NTT}^\text{NWC}_{\omega_\frac{d}{2}} (\mathbf{f})_j \end{align*}\]Therefore, we can write down the overall expression for $R^\text{even}$, again using the $\tilde{R^\text{even}}$ abbreviation for the halfNTT procedure.
\[\begin{align*} \tilde{R^\text{even}}_j &= \tilde{P^\text{even}}_j \cdot \tilde{Q^\text{even}}_j + \psi_d^{2j+1} \tilde{P^\text{odd}}_j \cdot \tilde{Q^\text{odd}}_j \end{align*}\]We notice that, the multiplier $\psi_d^{2j+1}$ is different for distinct $j$ index of the halfNTT vector.
Let’s now implement this part as well.
GEN_POW_ODD = [GEN_POW[i] for i in range(1, 256, 2)]
def ntt_half_nwc_mul_even(p_even_ntt, p_odd_ntt, q_even_ntt, q_odd_ntt, gen=GEN, modulus=MODULUS):
# Coordinatewise multiplication of the eveneven, oddodd terms.
peqe = [(i * j) % modulus for i, j in zip(p_even_ntt, q_even_ntt)]
poqo = [(i * j) % modulus for i, j in zip(p_odd_ntt, q_odd_ntt)]
# Coordinatewise addition and multiplication by psi_d^{2j+1}.
return [(i + j * k) % modulus for i, j, k in zip(peqe, poqo, GEN_POW_ODD)]
r_even_ntt = ntt_half_nwc_mul_even(p_even_ntt, p_odd_ntt, q_even_ntt, q_odd_ntt)
r_even = intt_half_nwc(r_even_ntt)
assert(r_even == r_naive_even)
Putting everything together
Congratulations! We have just demonstrated computing the even part and the odd part of the product of two degree$d$ polynomials $P, Q$, using only halfNTT subroutines that can work with degree$\frac{d}{2}$ polynomials in the NWC field of $(x^\frac{d}{2} + 1)$.
Now, let’s formally define the steps we need to take in order to multiply the two degree$d$ polynomials $P, Q$:

Fix a $d$th primitive root of unity $\omega_d$ in field $\mathbb{Z}_q$. Instantiate the halfNTT procedure using $\omega_d^2$ as the twiddle factor (hence handling polynomials half the size).

Split $P, Q$ into the even and odd polynomials, apply halfNTT to convert into NTT domain and obtain $\tilde{P^\text{even}}, \tilde{P^\text{odd}}, \tilde{Q^\text{even}}, \tilde{Q^\text{odd}}$ with half the size of the original polynomial.

For each coordinate $j$, compute $\tilde{R^\text{even}}_j = \tilde{P^\text{even}}_j \cdot \tilde{Q^\text{even}}_j + \psi_d^{2j+1} \tilde{P^\text{odd}}_j \cdot \tilde{Q^\text{odd}}_j$.

For each coordinate $j$, compute $\tilde{R^\text{odd}}_j = \tilde{P^\text{even}}_j \cdot \tilde{Q^\text{odd}}_j + \tilde{P^\text{odd}}_j \cdot \tilde{Q^\text{even}}_j$.

Convert $\tilde{R^\text{even}}, \tilde{R^\text{odd}}$ back to the coefficients form $R^\text{even}, R^\text{odd}$ via halfiNTT.

Recover the final polynomial coefficients by interleaving coefficients of $R^\text{even}, R^\text{odd}$:
def ntt_nwc_split_mul(p, q, gen=GEN, modulus=MODULUS):
# Split P and Q.
p_even, p_odd = split_poly(p)
q_even, q_odd = split_poly(q)
# Perform halfNTTtransform.
p_even_ntt, p_odd_ntt = ntt_half_nwc(p_even), ntt_half_nwc(p_odd)
q_even_ntt, q_odd_ntt = ntt_half_nwc(q_even), ntt_half_nwc(q_odd)
# Compute R_even_ntt and R_odd_ntt.
r_even_ntt = ntt_half_nwc_mul_even(p_even_ntt, p_odd_ntt, q_even_ntt, q_odd_ntt)
r_odd_ntt = ntt_half_nwc_mul_odd(p_even_ntt, p_odd_ntt, q_even_ntt, q_odd_ntt)
# Undo halfNTTtransform.
r_even = intt_half_nwc(r_even_ntt)
r_odd = intt_half_nwc(r_odd_ntt)
# Recombine the two halfs together.
return [r_even[i//2] if i % 2 == 0 else r_odd[i//2] for i in range(len(p))]
# Let's test this!
import random
random.seed(0xdeadbeef)
for _ in range(20):
p = [random.randint(0, MODULUS1) for i in range(256)]
q = [random.randint(0, MODULUS1) for i in range(256)]
r = ntt_nwc_split_mul(p, q)
assert r == mul_poly_naive_q_nwc(p, q, MODULUS, 256)
Performance
Let’s now do a performance analysis of our new polynomial multiplication algorithm.
Looking at the code, the first and last parts are splitting and joining the polynomials. This is utmost linear $O(d)$ cost, or even with no cost if we are purely reindexing the original polynomials without actually splitting.
The second and fourth sections are performing two halfNTT (or inverse) operations, which have a runtime of $O (d \log{d})$ given the recurrence relation we show in the last post. This is more or less similar to what we have before.
The third section is the part that’s slightly different. Compared to the original coordinatewise single multiplication over $d$ elements which is linear $O(d)$ cost, we are doing totally 5 field multiplications and two additions over $\frac{d}{2}$ elements which is $O(2.5d)$ cost. However, asymptotically, we can still say it’s linear $O(d)$ cost.
Overall, we see that the new approach performs pretty much the same asymptotically! However, practically, we are spending 2.5x more time in the coordinatewise multiplication due to the fact that we have to work with smaller degree polynomial multiplications. This is the additional price we are paying for choosing a more compact field.
Parallelizing halfNTTs
Practically, there’s one more thing we can potentially optimize, which is the two halfNTT calls on the even and odd polynomials. Since they are the same halfNTT operation sharing the same twiddle factors, we can combine two transformations into one to halve the iterations needed for the step.
Once we put the two halfNTTs sidebyside, we would realize that this looks identical to the full degree$d$ iterative NTT, but without doing the first layer such that the even and the odd coefficients are not being coupled. We can also fold all the scaling factors into the iterative process, such that we don’t need any prior/post modification on the input/output.
Eventually, with all the bells and whistles applied, we finally reach the (i)NTT implementation used in Kyber from its specification:
INV_2 = pow(2, 1, MODULUS)
def ntt_iter_kyber(x, gen=17, modulus=3329):
res = x[:]
stride = 128
zz = 0
while stride > 1:
for start in range(0, 256, stride * 2):
# Compute twiddle factor and NWC \psi_d^i scalar.
zz += 1
zp = pow(gen, brv(zz, 7), modulus)
for i in range(start, start + stride):
# CT butterfly.
a = res[i]
b = res[i+stride]
t = (zp * b) % modulus
res[i] = (a + t) % modulus
res[i+stride] = (a  t) % modulus
stride >>= 1
return res
def intt_iter_kyber(x, gen=17, modulus=3329):
res = x[:]
stride = 2
zz = 128
while stride < 256:
for start in range(0, 256, stride * 2):
# Compute twiddle factor and NWC \psi_d^i scalar.
zz = 1
zp = pow(gen, brv(zz, 7), modulus)
for i in range(start, start + stride):
a = res[i]
b = res[i+stride]
res[i] = (INV_2*(a + b)) % MODULUS
res[i+stride] = (INV_2 * zp * (b  a)) % MODULUS
stride <<= 1
return res
def ntt_mul_kyber(p, q, gen=17, modulus=3329):
res = [0]*256
for i in range(0, len(p), 2):
p_even, p_odd = p[i], p[i+1]
q_even, q_odd = q[i], q[i+1]
z = pow(gen, 2*brv(i//2, 7)+1, modulus)
res[i] = (p_even * q_even + z * p_odd * q_odd) % modulus
res[i+1] = (p_odd * q_even + p_even * q_odd) % modulus
return res
# Let's test this!
import random
random.seed(0xdeadbeef)
for _ in range(20):
p = [random.randint(0, MODULUS1) for i in range(256)]
q = [random.randint(0, MODULUS1) for i in range(256)]
r = intt_iter_kyber(ntt_mul_kyber(ntt_iter_kyber(p), ntt_iter_kyber(q)))
assert r == mul_poly_naive_q_nwc(p, q, MODULUS, 256)
Kyber’s implementation looks slightly different from our code earlier. This is due to the following reasons:

Kyber works in bitreversed order in its NTT form, and normal order in its coefficient form. Our implementation is the opposite due to the way the iterative NTT circuit is built. This will not affect the correctness or performance of the implementation, as long as it’s consistent throughout the software.

Like the optimizations we mentioned above, Kyber merges the multiplication with $\psi_d^{\pm i}$ with the iterative NTT procedure, so no pre/post processing is needed.

Kyber also performs all the operations inplace, so it never actually “splits” the polynomial into two.
Generalizing the Kyber trick
Before we wrap up, I will have to admit that I intentionally omitted a large concept when writing this post. We’ve managed to come this far only via working with polynomials in the coefficient form and the evaluation form.
However, if we want to generalize the trick used in Kyber and further reduce the field, like maybe to the point that we can only handle degree$\frac{d}{4}$ multiplications, then we will perhaps run into a lot of difficulties interpreting everything in the way we used to do.
The missing concept here is called the Chinese Remainder Theorem (CRT) form. I will maybe write another post that deep dives into the topic, but all we need to know right now is that there exists an isomorphism between the CRT form and the evaluation form, which means anything we’ve able to achieve with evaluating polynomials have another corresponding meaning in the CRT realm. This isomorphism allows us to generalize the Kyber trick further.
Summary of learnings
If you were able to read to this part, huge congrats! This is all you need to know about NTT in a smaller field used in Kyber.
Let’s do a quick recap of what we have achieved in this post:

First, we started off with the results from the last post: NTT defined in the CC modulus and the NWC modulus. To enable coordinatewise multiplication between polynomials, the NTT procedure needs to be slightly different depending on the binary split tree produced by the polynomial modulus.

Second, we looked into the problem that in order to multiply degree$d$ polynomials in the NWC modulus, there has to exist some $2d$th primitive root of unity $\psi_{2d}$. Without this, we cannot split the second last layer in the binary split tree, and cannot obtain the roots to evaluate on. However, having $\psi_{2d}$ is a stronger assumption than having $\omega_d$, and it will make the underlying field much larger than the CC case (3329 => 12289).

Third, we took a step back, and start thinking about whether we can still do multiplications in NWC modulus without the existence of $\psi_2d$. The solution we had is to define a halfNTT procedure that supports polynomial multiplication with degree$\frac{d}{2}$. With the halfNTT procedure, we decompose the degree$d$ polynomial into its even and odd subparts, each with dimension$\frac{d}{2}$. Multiplication is defined as several crossmultiplications and additions between the smaller subparts.

Lastly, we implemented the halfNTT and crossmultiplication trick, and reasoned that its runtime complexity is asymptotically the same but practically a bit more expensive than vanilla NWC NTT. We also saw how Kyber implements this trick in the end.
At this point, we have covered most of the interesting items about NTT. What we’ve learned will build a solid foundation for efficient implementation of polynomialbased cryptosystems, such as Kyber and Falcon.
If you are still curious about NTT, I would definitely recommend checking out this amazing survey paper which inspired this series.
Up next, we will take a stab at more efficient implementation of NTT. Here is a hint of what we will look at: Notice that in our current implementation, there exists many \% operators that performs modular reduction in the finite field $\mathbb{Z}_q$. However, the reductions themselves are not that easy to do, and also hard to be done in a secure way. We will be checking out a few strategies to make the reductions behave nicer.
See you next time!