Numbertheoretic transform (integer DFT)
Introduction
The NTT is a generalization of the classic DFT to finite fields. With a lot of work, it basically lets one perform fast convolutions on integer sequences without any roundoff errors, guaranteed. Convolutions are useful for multiplying large numbers or long polynomials, and the NTT is asymptotically faster than other methods like Karatsuba multiplication.
Review of the complex DFT
The classic discrete Fourier transform (DFT) operates on vectors of complex numbers:
Suppose the input vector has length \(n\). The output vector will also have length \(n\).
Let \(ω\) (omega) be a primitive \(n\)th root of unity. In other words, \(ω^n = 1\), but \(ω^k \ne 1\) for all integers \(1 \le k < n\). The standard choice for the DFT is \(ω = e^{2πi/n}\).
Each output element equals a particular weighted sum of all input elements, using some powers of \(ω\) as weights. Denoting the input vector as \(X\) and the output vector as \(Y\), we have (with 0based indexing): \(Y(k) = X(0) ω^{0k} + X(1) ω^{1k} + X(2) ω^{2k} + ... + X(n1) ω^{(n1)k}\).
The inverse transform, which restores the original vector, is given by: \(X(k) n = Y(0) ω^{0k} + Y(1) ω^{1k} + Y(2) ω^{2k} + \cdots + Y(n1) ω^{(n1)k}\).
Additional notes:
Linear algebra guarantees that if \(ω\) is indeed a primitive \(n\)th root of unity of the working field, then the inverse transform is correct.
One application of the DFT is to compute a circular convolution of two samelength vectors \(X\) and \(Y\). To do so, compute the forward DFT on \(X\) and \(Y\), then pointwise multiply the two vectors together, compute the inverse transform on this vector, and finally divide by a scaling factor of \(n\). Again, this procedure will work in any suitable field of numbers.
The DFT’s most popular application of analyzing/representing a signal as a sum of sine waves doesn’t work in the numbertheoretic transform.
Procedure for the NTT
Suppose the input vector is a sequence of \(n\) nonnegative integers.
Choose a minimum working modulus \(M\) such that \(1 \le n < M\) and every input value is in the range \([0, M)\).
Select some integer \(k \ge 1\) and define \(N = kn + 1\) as the working modulus. We require \(N \ge M\), and \(N\) to be a prime number. Dirichlet’s theorem guarantees that for any \(n\) and \(M\), there exists some choice of \(k\) to make \(N\) be prime.
Because \(N\) is prime, the multiplicative group of \(\mathbb{Z}_N\) has size \(φ(N) = N1 = kn\). Furthermore, the group must have at least one generator \(g\), which is also a primitive \((N1)\)th root of unity.
Define \(ω \equiv g^k \text{ mod } N\). We have \(ω^n = g^{kn} = g^{N1} = g^{φ(N)} \equiv 1 \text{ mod } N\) due to Euler’s theorem. Furthermore because \(g\) is a generator, we know that \(ω^i = g^{ik} \not\equiv 1\) for \(1 \le i < n\), because \(ik < nk = N1\). Hence \(ω\) is a primitive \(n\)th root of unity, as required by the DFT of length \(n\).
The rest of the procedure for the forward and inverse transforms is identical to the complex DFT. Moreover, the NTT can be modified to implement a fast Fourier transform algorithm such as CooleyTukey.
Additional notes:
When convolving two vectors of length \(n\) where each input value is at most \(m\), the upper bound on each output value is \(m^2 n\). Choosing a minimum working modulus of \(M = m^2 n + 1\) is sufficient to always avoid overflow in the worst case.

For the prime field \(\mathbb{Z}_N\), we can either find a generator of the field then derive a primitive \(n\)th root of unity (as mentioned in the procedure), or directly find a primitive \(n\)th root of unity.
Such a field has \(φ(φ(N)) = φ(N1) = φ(kn)\) generators but \(φ(n)\) primitive \(n\)th roots of unity. The first number is greater than or equal to the second number. So if we are sampling random candidates in the range \([0, N)\), we are more likely (or equally likely) to succeed in finding a generator than finding a primitive \(n\)th root.
To find a generator, first fully factorize \(N1\) and collect its set of unique prime factors. A candidate \(a \in (1, N)\) is a generator of \(\mathbb{Z}_N\) if and only if {for each \(p\) in that set of unique prime factors, \(a^{(N1)/p} \not\equiv 1 \text{ mod } N\)}.
To find a primitive root directly, first fully factorize \(n\) and collect its set of prime factors. A candidate \(a \in (1, N)\) is a primitive \(n\)th root of unity in \(\mathbb{Z}_N\) if and only if {\(a^n \equiv 1 \text{ mod } N\), and for each \(p\) in that set of unique prime factors, \(a^{n/p} \not\equiv 1 \text{ mod } N\)}.
Although the procedure only describes prime fields for simplicity, it might also be possible to operate on composite rings (e.g. \(\mathbb{Z}_{100}\)) if a primitive \(n\)th root of unity exists. This could be useful if a composite modulus is much smaller than a prime modulus of the form \(N = kn + 1\).
Furthermore, it should be possible to operate on primepower fields such as \(\text{GF}(2^8)\). However, this is only useful if the input vector is composed of elements from that field, instead of being plain integers.
Computing an NTT requires many modular multiplications. It is possible to apply Montgomery reductions (or the less efficient Barrett reductions) to speed up the modular arithmetic in an NTT.
Examples

Input vector \(X = (6, 0, 10, 7, 2)\), with length \(n = 5\). Compute a forward transform.
Define the minimum working modulus of \(M = 11\). We have \(1 \le n < M\), and every input value is in the range \([0, 11)\).
Select \(k = 2\) so that with \(N = kn + 1 = 2 × 5 + 1 = 11\), we have \(N \ge M\), and \(N\) is prime.
Try to find a generator of \(\mathbb{Z}_{11}\). Fully factorize \(N1 = 10 = 2 × 5\). Choose a candidate \(a = 6\). We have {\(a^{10/2} = a^5 \equiv 10 \not\equiv 1 \text{ mod } 11\)} and {\(a^{10/5} = a^2 \equiv 3 \not\equiv 1 \text{ mod } 11\)}, so \(g = 6\) is indeed a generator.
Calculate \(ω = g^k = 6^2 \equiv 3 \text{ mod } 11\). This is our primitive \(5\)th root of unity.
Compute the output vector \(Y\):
\(Y(0) = X(0)ω^{0×0} + X(1)ω^{0×1} + X(2)ω^{0×2} + X(3)ω^{0×3} + X(4)ω^{0×4} \equiv 3 \text{ mod } 11\).
\(Y(1) = X(0)ω^{1×0} + X(1)ω^{1×1} + X(2)ω^{1×2} + X(3)ω^{1×3} + X(4)ω^{1×4} \equiv 7 \text{ mod } 11\).
\(Y(2) = X(0)ω^{2×0} + X(1)ω^{2×1} + X(2)ω^{2×2} + X(3)ω^{2×3} + X(4)ω^{2×4} \equiv 0 \text{ mod } 11\).
\(Y(3) = X(0)ω^{3×0} + X(1)ω^{3×1} + X(2)ω^{3×2} + X(3)ω^{3×3} + X(4)ω^{3×4} \equiv 5 \text{ mod } 11\).
\(Y(4) = X(0)ω^{4×0} + X(1)ω^{4×1} + X(2)ω^{4×2} + X(3)ω^{4×3} + X(4)ω^{4×4} \equiv 4 \text{ mod } 11\).

Input vector \(Y = (3, 7, 0, 5, 4)\), with length \(n = 5\). Compute the corresponding inverse transform for the example above.
We must use the same values of \(k = 2\), \(N = 11\), and \(ω = 3\) as in the forward transform. Note that \(ω^{1} \equiv 4 \text{ mod } 11\).
Compute the unscaled output vector \(X n\):
\(X(0) n = Y(0)ω^{0×0} + Y(1)ω^{0×1} + \cdots + Y(4)ω^{0×4} \equiv 8 \text{ mod } 11\).
\(X(1) n = Y(0)ω^{1×0} + Y(1)ω^{1×1} + \cdots + Y(4)ω^{1×4} \equiv 0 \text{ mod } 11\).
\(X(2) n = Y(0)ω^{2×0} + Y(1)ω^{2×1} + \cdots + Y(4)ω^{2×4} \equiv 6 \text{ mod } 11\).
\(X(3) n = Y(0)ω^{3×0} + Y(1)ω^{3×1} + \cdots + Y(4)ω^{3×4} \equiv 2 \text{ mod } 11\).
\(X(4) n = Y(0)ω^{4×0} + Y(1)ω^{4×1} + \cdots + Y(4)ω^{4×4} \equiv 10 \text{ mod } 11\).Finally, multiply each element by \(n^{1} = 5^{1} \equiv 9 \text{ mod } 11\) to get back the original values: \((8, 0, 6, 2, 10) × 9 \equiv (6, 0, 10, 7, 2) \text{ mod } 11\).

Input vectors \(X = (4, 1, 4, 2, 1, 3, 5, 6)\) and \(Y = (6, 1, 8, 0, 3, 3, 9, 8)\), with length \(n = 8\). Compute their circular convolution.
Define the minimum working modulus of \(M = 9 × 9 × 8 + 1 = 649\), because each input value is a singledigit integer (less than 10) and we want to avoid the convolution output from overflowing.
Select \(k = 84\) so that with \(N = 84 × 8 + 1 = 673\), we have \(N \ge M\), and \(N\) is prime.
Try to find a primitive \(8\)th root of unity directly. Fully factorize \(n = 8 = 2 × 2 × 2\). Choose a candidate \(a = 326\). We have {\(a^8 \equiv 1 \text{ mod } 673\)} and {\(a^{8/2} = a^4 \equiv 672 \not\equiv 1 \text{ mod } 673\)}, so \(ω = 326\) is indeed a primitive \(8\)th root of unity.
Compute the forward transforms:
\(X' = \text{NTT}(X) = (26, 338, 228, 115, 2, 457, 437, 448)\).
\(Y' = \text{NTT}(Y) = (38, 594, 224, 157, 14, 201, 433, 406)\).Compute the pointwise multiplication of the two vectors modulo \(N\): \(Z' = X' Y' = (315, 218, 597, 557, 28, 329, 108, 178)\).
Compute the inverse NTT with scaling:
\(Z = \text{INTT}(Z') = (123, 120, 106, 92, 139, 144, 140, 124)\).This result represents a circular convolution because, for example, \(Z(0) = X(0)Y(0) + X(1)Y(7) + X(2)Y(6) + \cdots + X(7)Y(1)\). That is to say, each \(Z(i) \equiv \displaystyle\sum_{j=0}^{n1} X(j)Y((i  j) \text{ mod } n) \text{ mod } N\).
Source code
 Python
 Java (
int
size)  Java (
BigInteger
size)
Library features:
 Finding suitable prime modulus
 Finding primitive \(n\)th root of unity
 Computing forward and inverse transforms (naive \(Θ(n^2)\) algorithm)
 Computing fast forward transform (CooleyTukey \(Θ(n \log n)\) algorithm)
 Computing circular convolution (using naive algorithm)
 Unit tests for all functions (some hardcoded vectors, some randomized cases)
Proof of DFT/NTT correctness
The forward NTT is given by the following matrix:
\(A = \left[\begin{matrix} ω^0 & ω^0 & ω^0 & \cdots & ω^0 \\ ω^0 & ω^1 & ω^2 & \cdots & ω^{n1} \\ ω^0 & ω^2 & ω^4 & \cdots & ω^{2(n1)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ ω^0 & ω^{n1} & ω^{2(n1)} & \cdots & ω^{(n1)^2} \end{matrix}\right]\).The unscaled inverse NTT is given by the following matrix:
\(B = \left[\begin{matrix} ω^{0} & ω^{0} & ω^{0} & \cdots & ω^{0} \\ ω^{0} & ω^{1} & ω^{2} & \cdots & ω^{(n1)} \\ ω^{0} & ω^{2} & ω^{4} & \cdots & ω^{2(n1)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ ω^{0} & ω^{(n1)} & ω^{2(n1)} & \cdots & ω^{(n1)^2} \end{matrix}\right]\).We want to show that their product is a scaled identity matrix, i.e. \(C = AB = nI\), where
\(I = \left[\begin{matrix} 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{matrix}\right]\).Examine an arbitrary cell on the main diagonal:
\(C_{i,i} = [A \text{ row } i] · [B \text{ col } i] = \displaystyle\sum_{k=0}^{n1} ω^{ik} ω^{ik} = \displaystyle\sum_{k=0}^{n1} ω^0 = \displaystyle\sum_{k=0}^{n1} 1 = n\).
Now examine an arbitrary cell off the main diagonal, with \(i ≠ j\), letting \(δ = i  j\), where \(n < δ < n\) and \(δ ≠ 0\):
\(C_{i,j} = [A \text{ row } i] · [B \text{ col } j] = \displaystyle\sum_{k=0}^{n1} ω^{ik} ω^{jk} = \displaystyle\sum_{k=0}^{n1} ω^{(ij)k} = \displaystyle\sum_{k=0}^{n1} ω^{δk}\).This sum is a geometric series. We know that in general,
\(1 + x + x^2 + x^3 + \cdots + x^{n1} = \displaystyle\frac{x^n  1}{x  1}\).Hence \(C_{i,j} = \displaystyle\sum_{k=0}^{n1} ω^{δk} = \displaystyle\frac{ω^{δn}  1}{ω^δ  1} = 0\), which requires a bit more explanation.
In the numerator, \(ω^n = 1\) because \(ω\) is a (primitive) \(n\)th root of unity, so \((ω^n)^δ = 1\) and the overall numerator is \(0\).
In the denominator, \(ω\) is a primitive \(n\)th root of unity and \(n < δ < n\) and \(δ ≠ 0\), thus \(ω^δ ≠ 1\), which avoids division by zero.
Therefore we have proven that the matrix product \(AB\) is a scaled identity matrix, which shows that the numbertheoretic transform is invertible up to a scale factor.
More info
 apfloat: Number theoretic transforms
 Wikipedia: Discrete Fourier transform (general)  Numbertheoretic transform
 University of California, Los Angeles: EE133A (Applied Numerical Computing)  Orthogonal matrices