Performing fast 32-bit multiplications is crucial for many embedded and IoT applications using Cortex-M0/M0+/M1 chips. This article provides an in-depth guide on optimizing 32×32 bit multiplications on these microcontrollers.
The Basics of 32-bit Multiplication
Conceptually, multiplying two 32-bit numbers requires multiplying each bit of one operand with each bit of the other operand. This generates 32×32 = 1024 partial products. These partial products are shifted and added together in a tree-like structure to produce the final 64-bit result.
In hardware, this process is implemented efficiently using carry lookahead adders and fast shifters. However, Cortex-M0/M0+/M1 lack dedicated multiplier hardware. So 32×32 multiplication has to be performed using the core ALU and barrel shifter.
A naive software implementation would emulate the hardware process – generate all partial products, shift and add them together using 32×32 individual multiply and add instructions. This is extremely slow.
Instead, an optimized approach is to break down 32-bit operands into smaller chunks that can be multiplied faster. 16×16 bit multiplications only require 256 operations and are natively supported by the UMULL instruction on Cortex-M. We can break down 32×32 multiplication into 16×16 operations as follows: A = Ahl + Al B = Bhl + Bl A x B = Ahl x Bhl + Ahl x Bl + Al x Bhl + Al x Bl
Each of the 4 terms above can be performed using UMULL on 16-bit values. The results are shifted and added to produce the final output.
Optimizing the Software
Further optimizations can significantly speed up the software implementation:
Breaking down operands into 8-bit chunks instead of 16-bits reduces the number of partial products and can improve performance if the processor has fast 8×8 multiplicationsupport. For example, Cortex-M0+ has an 8×8 multiplier. A = Ahh + Ah + Al B = Bhh + Bh + Bl A x B = Ahh x Bhh + Ahh x Bh + Ahh x Bl + Ah x Bhh + Ah x Bh + Ah x Bl + Al x Bhh + Al x Bh + Al x Bl
Unrolling the loops performing the partial products and summations can help pipeline computations better. For example, with 16-bit breakdown, we can compute 2 partial products in parallel in each loop iteration instead of just one. loop: partial1 = Ahl * Bhl partial2 = Ahl * Bl partial1 += partial2 …
Carefully ordering the partial product computation allows common sub-expressions to be reused. For example, we can compute Ah x B first and then use Ah x Bl and Ah x Bh without re-reading Ah.
Replace expensive operations like multiplication with cheaper ones like addition/subtraction/shifts whenever possible. For example, Al x Bl can be computed using (Al + Bl) << 1 – Al – Bl.
Small lookup tables can help replace certain partial products with simple memory reads. For 8-bit operands, a 64-byte lookup table can hold precomputed products and significantly speed up 8×8 multiplication.
Putting it Together
Combining the optimizations above provides significant performance improvements. Some key steps are:
- Break operands into 8-bit chunks supported by hardware multiplier (if available)
- Reorder partial products to maximize reuse
- Unroll loops to pipeline computations
- Employ strength reduction and lookup tables
Proper operand ordering and strength reduction alone can provide up to 2-3x speedup. Lo