SoC
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
  • Arm Cortex M3
  • Contact
Reading: Optimizing 32×32 bit Multiplication on Cortex-M0/M0+/M1
SUBSCRIBE
SoCSoC
Font ResizerAa
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Search
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Have an existing account? Sign In
Follow US
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
© S-O-C.ORG, All Rights Reserved.
Arm

Optimizing 32×32 bit Multiplication on Cortex-M0/M0+/M1

David Moore
Last updated: October 5, 2023 9:58 am
David Moore 4 Min Read
Share
SHARE

Performing fast 32-bit multiplications is crucial for many embedded and IoT applications using Cortex-M0/M0+/M1 chips. This article provides an in-depth guide on optimizing 32×32 bit multiplications on these microcontrollers.

Contents
The Basics of 32-bit MultiplicationSoftware ImplementationOptimizing the SoftwareOperand BreakdownLoop UnrollingOperand ReorderingStrength ReductionLookup TablesPutting it Together

The Basics of 32-bit Multiplication

Conceptually, multiplying two 32-bit numbers requires multiplying each bit of one operand with each bit of the other operand. This generates 32×32 = 1024 partial products. These partial products are shifted and added together in a tree-like structure to produce the final 64-bit result.

In hardware, this process is implemented efficiently using carry lookahead adders and fast shifters. However, Cortex-M0/M0+/M1 lack dedicated multiplier hardware. So 32×32 multiplication has to be performed using the core ALU and barrel shifter.

Software Implementation

A naive software implementation would emulate the hardware process – generate all partial products, shift and add them together using 32×32 individual multiply and add instructions. This is extremely slow.

Instead, an optimized approach is to break down 32-bit operands into smaller chunks that can be multiplied faster. 16×16 bit multiplications only require 256 operations and are natively supported by the UMULL instruction on Cortex-M. We can break down 32×32 multiplication into 16×16 operations as follows: A = Ahl + Al B = Bhl + Bl A x B = Ahl x Bhl + Ahl x Bl + Al x Bhl + Al x Bl

Each of the 4 terms above can be performed using UMULL on 16-bit values. The results are shifted and added to produce the final output.

Optimizing the Software

Further optimizations can significantly speed up the software implementation:

Operand Breakdown

Breaking down operands into 8-bit chunks instead of 16-bits reduces the number of partial products and can improve performance if the processor has fast 8×8 multiplicationsupport. For example, Cortex-M0+ has an 8×8 multiplier. A = Ahh + Ah + Al B = Bhh + Bh + Bl A x B = Ahh x Bhh + Ahh x Bh + Ahh x Bl + Ah x Bhh + Ah x Bh + Ah x Bl + Al x Bhh + Al x Bh + Al x Bl

Loop Unrolling

Unrolling the loops performing the partial products and summations can help pipeline computations better. For example, with 16-bit breakdown, we can compute 2 partial products in parallel in each loop iteration instead of just one. loop: partial1 = Ahl * Bhl partial2 = Ahl * Bl partial1 += partial2 …

Operand Reordering

Carefully ordering the partial product computation allows common sub-expressions to be reused. For example, we can compute Ah x B first and then use Ah x Bl and Ah x Bh without re-reading Ah.

Strength Reduction

Replace expensive operations like multiplication with cheaper ones like addition/subtraction/shifts whenever possible. For example, Al x Bl can be computed using (Al + Bl) << 1 – Al – Bl.

Lookup Tables

Small lookup tables can help replace certain partial products with simple memory reads. For 8-bit operands, a 64-byte lookup table can hold precomputed products and significantly speed up 8×8 multiplication.

Putting it Together

Combining the optimizations above provides significant performance improvements. Some key steps are:

  1. Break operands into 8-bit chunks supported by hardware multiplier (if available)
  2. Reorder partial products to maximize reuse
  3. Unroll loops to pipeline computations
  4. Employ strength reduction and lookup tables

Proper operand ordering and strength reduction alone can provide up to 2-3x speedup. Lo

Newsletter Form (#3)

More ARM insights right in your inbox

 


Share This Article
Facebook Twitter Email Copy Link Print
Previous Article Integrating Cortex-M1 with JTAG debugger
Next Article Using De Bruijn sequences for faster count leading zeros (CLZ)
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

2k Followers Like
3k Followers Follow
10.1k Followers Pin
- Sponsored-
Ad image

You Might Also Like

Why Cortex-M Requires Its First Word as Initial Stack Pointer?

The Cortex-M processor is an extremely popular 32-bit ARM processor…

6 Min Read

How to Set Up External SPI Memory with an ARM Cortex-M3?

Setting up external SPI memory with an ARM Cortex-M3 microcontroller…

15 Min Read

What is the difference between ARM Cortex-M0 and M3?

The ARM Cortex-M0 and Cortex-M3 are two of the most…

7 Min Read

How to change endianess settings in cortex m3?

The endianess setting in Cortex-M3 processors refers to the byte…

7 Min Read
SoCSoC
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
Welcome Back!

Sign in to your account