# Residue Number System

# 1 Conversion

# 1.1 From Integer

# 1.2 To Integer

# 1.2.1 Chinese Remainder Theorem

The standard method is specified as a sum of n large numbers M_i multiplied by small numbers, with a lookup table of n small multiplicative inverses:

For a moduli set \({m_i}\), \(i=1,n\) with the dynamic range \(M = \prod_{i=1}^n m_i\), the residue number \((x_1, x_2, x_3, ..., x_n)\) can be converted into the decimal number \(X\), according to the Chinese Reminder Theorem, as follows: \[X = \left| \sum_{i=1}^n M_i \left| M_i^{-1} x_i \right|_{m_i} \right|_M\] where \(M_i = M / m_i\), and \(M_i^{-1}\) is the multiplicative inverse of \(M_i\) with respect to \(m_i\). – Kazeem Alagbe Gbolagade and Sorin Dan Cotofana, “An Efficient RNS to Binary Converter Using the Moduli Set \(\{2n + 1, 2n, 2n − 1\}\)”, https://ce-publications.et.tudelft.nl/publications/521_an_efficient_rns_to_binary_converter_using_the_moduli_set_2.pdf

# 1.2.2 Mixed-Radix Conversion

Using the moduli as bases for a mixed-radix number system, conversion takes O(n²) work but can be parallelized to O(n) time on O(n) processors.

… we introduce an RNS to Mixed Radix Conversion (MRC) technique, which addresses the computation of Mixed Radix (MR) digits in such a way that enables the MRC parallelization. – Kazeem Alagbe Gbolagade and Sorin Dan Cotofana, “An O(n) Residue Number System to Mixed Radix Conversion Technique”, https://ce-publications.et.tudelft.nl/publications/334_an_on_residue_number_system_to_mixed_radix_conversion_tech.pdf

The least significant digit is computed first, and the most significant digit is computed last.

Serial computation (ibid formula (2) p522):

\[ X = a_1 + a_2 m_1 + a_3 m_2 _m1 + ... + a_n m_{m-1} ... m_2 m_1 \] \[ a_1 = x_1 \] \[ a_2 = \left| (x_2 - a_1) \left|m_1^{-1}\right|_{m_2} \right|_{m_2} \] \[ a_3 = \left| \left((x_3 - a_1) \left|m_1^{-1}\right|_{m_3} - a_2 \right)\left|m_2^{-1}\right|_{m_3}\right|_{m_3} \] \[ ... \] \[ a_n = \left| \left(\left(...(x_n - a_1) \left|m_1^{-1}\right|_{m_m} - a_2 \right)\left|m_2^{-1}\right|_{m_m} - ... - a_{n-1}\right)\left|m_{n-1}^{-1}\right|_{m_n}\right|_{m_n} \]

Alternatively, conversion takes O(n²) work parallelized to O(log(n)) time on O(n²) processors, using parallel prefix carry propagation.

Appendix – Markus A. Hitz and Erich Kaltofen, “Integer Division in Residue Number Systems”, in IEEE Transactions on Computers, vol. 44, no. 8, pp. 983-989, Aug. 1995, https://doi.org/10.1109/12.403714, https://users.cs.duke.edu/~elk27/bibliography/95/HiKa95.pdf, https://web.archive.org/web/20050217165652/http://www.cs.rpi.edu/research/ps/93-9.ps

# 1.2.3 Conversion Using Tables

Shenoy1998 describes a method only applicable to small numbers:

As a result RNS-to-binary conversion in dynamic ranges up to 64 bits can be achieved using only modular look-up tables and using small moduli. This dynamic range is adequate for most of the signal processing applications. However in applications where dynamic range requirements are larger, binary adders other than look-up tables are required to implement this technique as are required in the process of mixed radix conversion – A. P. Shenoy and R. Kumaresan, “Residue to binary conversion for RNS arithmetic using only modular look-up tables,” in IEEE Transactions on Circuits and Systems, vol. 35, no. 9, pp. 1158-1162, Sept. 1988, doi: 10.1109/31.7577.

# 1.3 To Float

\(x/M\) can be computed more easily than \(x\) itself: if we compute \(h_i\) as the inverse of \(\prod_{j\not=i} m_j\) modulo \(m_i\) with modular arithmetic, then \(x/M\) is the fractional part of \(\sum_i x_i h_i / m_i\). – Emil Jeřábek, “How to compare two numbers in RNS (residue number system)?”, MathOverflow, Sep 4 2025, https://mathoverflow.net/a/499938

However it needs \(\log M\) bits of accuracy to determine the integer part, so isn’t a useful as at first glance.

# 2 Basic Operations

# 2.1 Addition

Provided no overflow will occur, add the residues (modulo moduli); O(n) work perfectly parallelizable.

Overflows wrap around modulo M.

# 2.2 Subtraction

Provided no overflow will occur, subtract the residues (modulo moduli); O(n) work perfectly parallelizable.

Overflows wrap around modulo M.

# 2.3 Multiplication

Provided no overflow will occur, multiply the residues (modulo moduli); O(n) work perfectly parallelizable.

Overflows wrap around modulo M.

# 3 Division

# 3.1 Scaling

An extended RNS with twice the number of moduli provides the range required for multiplication and scaling. – Markus A. Hitz and Erich Kaltofen, “Integer Division in Residue Number Systems”, in IEEE Transactions on Computers, vol. 44, no. 8, pp. 983-989, Aug. 1995, https://doi.org/10.1109/12.403714, https://users.cs.duke.edu/~elk27/bibliography/95/HiKa95.pdf, https://web.archive.org/web/20050217165652/http://www.cs.rpi.edu/research/ps/93-9.ps

Have two sets of n moduli with ranges M < M’, then base extension of one set of residues computes the residues for the other set of moduli, in O(log(n)) time on O(n²) processors, using O(n²) space for lookup tables, given the mixed radix representation as input.

Then floor(U / M) = M^-1 (U - (U mod M)), where U mod M is base extended from the first set of moduli, then U - (U mod M) is a multiple of M so all the first set residues are 0, then M^-1 (U - (U mod M)) is computed in the second set of residues and base-extended back to the first set.

Overall O(n²) work in O(log(n)) time on O(n²) processors.

# 4 Comparison

# 4.1 Sign Estimation

Hung1994 proposes a method that needs O(n log n) space for lookup tables and takes O(log n) time with O(n) processors, where n is the number of moduli.

… a sign estimation procedure that computes the sign of a residue number to be positive, negative, or indeterminate. In the last case, magnitude of the number is guaranteed to be in a limited interval whose size is related to the cost of the sign estimation process. – C.Y. Hung, B. Parhami, “An approximate sign detection method for residue numbers and its application to RNS division,” Computers & Mathematics with Applications, Volume 27, Issue 4, 1994, Pages 23-35, ISSN 0898-1221, https://doi.org/10.1016/0898-1221(94)90052-3. (https://www.sciencedirect.com/science/article/pii/0898122194900523)

# 4.2 Interval Estimation

Isopov2018 proposes using fixed-precision floating point intervals to bound numbers, which can be calculated taking O(n) work in O(log n) time on O(n) processors.

The estimation, which is called the interval floating-point characteristic (IFC), is represented by two directed rounded bounds that are fixed-precision numbers. Generally, the time complexities of serial and parallel computations of IFC are linear and logarithmic functions of the size of the moduli set, respectively. The new method requires only small-integer and fixed-precision floating-point operations and focuses on arbitrary moduli sets with large dynamic ranges (~2¹⁰⁰⁰). – Konstantin Isupov and Vladimir Knyazkov, “Interval Estimation of Relative Values in Residue Number System”, in Journal of Circuits, Systems, and Computers Vol. 27, No. 1 (World Scientific Publishing Company, 2018) https://www.researchgate.net/profile/Vladimir-Knazkov/publication/339507972_Interval_Estimation_of_Relative_Values_in_Residue_Number_System/links/5e56650c299bf1bdb83b611a/Interval-Estimation-of-Relative-Values-in-Residue-Number-System.pdf

# 5 Applications

# 5.1 Ozaki Scheme II

fp64 emulation for matrix multiplication using int8 tensor cores on GPU.

… This study proposes a novel GEMM-based emulation method for matrix multiplication that leverages the Chinese Remainder Theorem. The proposed method inherits the computational efficiency of highly optimized GEMM routines and further enables control over the number of matrix multiplications, which can enhance computational accuracy. We present numerical experiments featuring INT8 Tensor Core operations on GPUs and FP64 arithmetic on CPUs as case studies. The results demonstrate that FP64 emulation using the proposed method achieves performance levels of up to 7.4 to 9.8 TFLOPS on the NVIDIA RTX 4090 and 56.6 to 80.2 TFLOPS on the NVIDIA GH200, exceeding the measured performance of native FP64 arithmetic. … – Katsuhisa Ozaki, Yuki Uchino, Toshiyuki Imamura, “Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique”, https://arxiv.org/abs/2504.08009