Optimizing 32x32 bit Multiplication on Cortex-M0/M0+/M1

Performing fast 32-bit multiplications is crucial for many embedded and IoT applications using Cortex-M0/M0+/M1 chips. This article provides an in-depth guide on optimizing 32×32 bit multiplications on these microcontrollers.

Contents

The Basics of 32-bit Multiplication Software Implementation Optimizing the Software Operand Breakdown Loop Unrolling Operand Reordering Strength Reduction Lookup Tables Putting it Together

The Basics of 32-bit Multiplication

Conceptually, multiplying two 32-bit numbers requires multiplying each bit of one operand with each bit of the other operand. This generates 32×32 = 1024 partial products. These partial products are shifted and added together in a tree-like structure to produce the final 64-bit result.

In hardware, this process is implemented efficiently using carry lookahead adders and fast shifters. However, Cortex-M0/M0+/M1 lack dedicated multiplier hardware. So 32×32 multiplication has to be performed using the core ALU and barrel shifter.

Software Implementation

A naive software implementation would emulate the hardware process – generate all partial products, shift and add them together using 32×32 individual multiply and add instructions. This is extremely slow.

Instead, an optimized approach is to break down 32-bit operands into smaller chunks that can be multiplied faster. 16×16 bit multiplications only require 256 operations and are natively supported by the UMULL instruction on Cortex-M. We can break down 32×32 multiplication into 16×16 operations as follows: A = Ahl + Al B = Bhl + Bl A x B = Ahl x Bhl + Ahl x Bl + Al x Bhl + Al x Bl

Each of the 4 terms above can be performed using UMULL on 16-bit values. The results are shifted and added to produce the final output.

Optimizing the Software

Further optimizations can significantly speed up the software implementation:

Operand Breakdown

Breaking down operands into 8-bit chunks instead of 16-bits reduces the number of partial products and can improve performance if the processor has fast 8×8 multiplicationsupport. For example, Cortex-M0+ has an 8×8 multiplier. A = Ahh + Ah + Al B = Bhh + Bh + Bl A x B = Ahh x Bhh + Ahh x Bh + Ahh x Bl + Ah x Bhh + Ah x Bh + Ah x Bl + Al x Bhh + Al x Bh + Al x Bl

Loop Unrolling

Unrolling the loops performing the partial products and summations can help pipeline computations better. For example, with 16-bit breakdown, we can compute 2 partial products in parallel in each loop iteration instead of just one. loop: partial1 = Ahl * Bhl partial2 = Ahl * Bl partial1 += partial2 …

Operand Reordering

Carefully ordering the partial product computation allows common sub-expressions to be reused. For example, we can compute Ah x B first and then use Ah x Bl and Ah x Bh without re-reading Ah.

Strength Reduction

Replace expensive operations like multiplication with cheaper ones like addition/subtraction/shifts whenever possible. For example, Al x Bl can be computed using (Al + Bl) << 1 – Al – Bl.

Lookup Tables

Small lookup tables can help replace certain partial products with simple memory reads. For 8-bit operands, a 64-byte lookup table can hold precomputed products and significantly speed up 8×8 multiplication.

Putting it Together

Combining the optimizations above provides significant performance improvements. Some key steps are:

Break operands into 8-bit chunks supported by hardware multiplier (if available)
Reorder partial products to maximize reuse

Unroll loops to pipeline computations
Employ strength reduction and lookup tables

Proper operand ordering and strength reduction alone can provide up to 2-3x speedup. Lo

Optimizing 32×32 bit Multiplication on Cortex-M0/M0+/M1

The Basics of 32-bit Multiplication

Software Implementation

Optimizing the Software

Operand Breakdown

Loop Unrolling

Operand Reordering

Strength Reduction

Lookup Tables

Putting it Together

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

Debugging tips for Cortex-M1 vector table configuration

What is the watchdog in ARM Cortex series?

What is the startup code of ARM in C?

What is ARM Cortex-R52+?