What are Multiply instructions (32-bit result/64-bit result) in Arm Cortex-M series?

The ARM Cortex-M series of processors support various multiply instructions that can produce 32-bit or 64-bit results. These instructions allow efficient multiplication operations on data values in registers. Knowing when to use 32-bit versus 64-bit multiply can help optimize code for performance and precision.

Contents

Overview of ARM Cortex-M Multiply Instructions When to Use 32-bit vs 64-bit Multiply 32-bit Multiply Instructions MUL MLA SMULBB, SMULBT, SMULTB, SMULTT 64-bit Multiply Instructions UMULL and UMULLS SMULL and SMULLS Multiplying Constants Choosing between MUL and UMULL/SMULL Compiler Intrinsics for Multiply Operations Multiply Instruction Timing Considerations for Smaller Cortex-M Cores Multiplying Floating Point Values Summary

Overview of ARM Cortex-M Multiply Instructions

Here is a quick overview of the main multiply instructions in Cortex-M processors:

MUL: 32-bit multiply, 32-bit result

MLA: Multiply with accumulate, 32-bit operands, 32-bit result
SMULBB, SMULBT, SMULTB, SMULTT: Signed multiply, 32-bit result
UMULL, UMULLS: Unsigned 64-bit multiply, 64-bit result

SMULL, SMULLS: Signed 64-bit multiply, 64-bit result

The 32-bit multiply instructions like MUL and MLA perform a 32-bit x 32-bit multiply and produce a 32-bit result. This is useful for efficiency when the 32-bit precision is enough.

The 64-bit multiply instructions like UMULL and SMULL perform a 32-bit x 32-bit multiply but produce a 64-bit result. This maintains precision but is less efficient.

When to Use 32-bit vs 64-bit Multiply

Choosing between 32-bit and 64-bit multiply depends on the data types and precision needed:

Use 32-bit multiply when multiplying 32-bit (unsigned int, signed int) values where 32-bit precision is enough.
Use 64-bit multiply when multiplying 32-bit values but require 64-bit precision for the result.

Use 64-bit multiply when multiplying values greater than 2^32 or requiring modulo greater than 2^32.
Prefer 32-bit multiply when performance is critical since it requires fewer cycles and registers.
Prefer 64-bit multiply when precision is critical since it maintains the full result.

32-bit Multiply Instructions

Let’s look at some common 32-bit multiply instructions in more detail:

MUL

MUL performs an unsigned 32-bit x 32-bit multiply and produces a 32-bit result. For example: MUL R0, R1, R2

This multiplies the unsigned int values in R1 and R2, truncates the result to 32-bit, and stores the result in R0. R1 and R2 remain unchanged.

MLA

MLA performs a signed 32-bit x 32-bit multiply with accumulate. It multiplies two signed 32-bit values, adds a 64-bit accumulate value, and produces a 32-bit result. For example: MLA R0, R1, R2, R3

This multiplies the signed int values in R1 and R2, adds the signed long long accumulate value in R3, truncates the result to 32-bit, and stores the result back in R0.

SMULBB, SMULBT, SMULTB, SMULTT

These perform signed 32-bit x 32-bit multiplies with some operand shifting. For example: SMULBB R1, R2, R3 // R1 = (R2[7:0] * R3[7:0]) << 1 SMULBT R1, R2, R3 // R1 = (R2[15:0] * R3[7:0]) << 1 SMULTB R1, R2, R3 // R1 = (R2[7:0] * R3[15:0]) << 1 SMULTT R1, R2, R3 // R1 = (R2[15:0] * R3[15:0]) << 1

This supports efficient signed multiplies on smaller data types.

64-bit Multiply Instructions

Here are some key 64-bit multiply instructions:

UMULL and UMULLS

UMULL performs an unsigned 32-bit x 32-bit multiply and produces a 64-bit result: UMULL R1, R2, R3, R4 // R1:R2 = R3 * R4 (unsigned long long result)

UMULLS is the setting version that sets condition flags.

SMULL and SMULLS

SMULL performs a signed 32-bit x 32-bit multiply and produces a 64-bit result: SMULL R1, R2, R3, R4 // R1:R2 = R3 * R4 (signed long long result)

SMULLS is the setting version that sets condition flags.

Multiplying Constants

When multiplying by a constant, consider using left shifts instead of multiply. For example: MUL R0, R1, #16

Could be replaced with: LSL R0, R1, #4 // R0 = R1 << 4 = R1 * 16

The LSL (logical shift left) instruction is often more efficient than a constant multiply.

Choosing between MUL and UMULL/SMULL

To summarize, follow these guidelines when choosing between 32-bit and 64-bit multiply in Cortex-M code:

Use MUL when 32-bit precision is enough for the multiplication result.
Use UMULL/SMULL when you need 64-bit precision for the multiplication result.

Use MUL when performance is critical and you only need 32-bit precision.
Use UMULL/SMULL for crypto code or multiplications that need modulo > 2^32.
Use SMULL for signed multiplies and UMULL for unsigned multiplies.

Use shift instructions instead of MUL when multiplying by a power of two.

Proper use of 32-bit versus 64-bit multiply instructions can help optimize Cortex-M code for both performance and precision.

Compiler Intrinsics for Multiply Operations

Here are some compiler intrinsics that map to the multiply instructions:

__SMULBB, __SMULBT, __SMULTB, __SMULTT – map to SMULxx instructions
__PKHBT, __PKHTB – pack halfwords, useful for multiplies
__SMLABB, __SMLABT, __SMLATB, __SMLATT – signed multiply accumulate

__SMLAD, __SMLADX – signed multiply accumulate dual
__SMLAL, __SMLALBB, __SMLALBT, __SMLALTB, __SMLALTT – 64-bit signed multiply accumulate
__SMLALD, __SMLALDX – signed multiply subtract dual accumulate long

__SMLAWB, __SMLAWT – signed multiply accumulate with round
__SMLSD, __SMLSDX – signed multiply subtract dual
__SMLSLD, __SMLSLDX – signed multiply subtract dual accumulate long

__SMMLA, __SMMLAR – signed most significant word multiply accumulate
__SMMLS, __SMMLSR – signed most significant word multiply subtract
__SMMLSR – signed most significant word multiply subtract reversed

__SMMUL, __SMMULR – signed most significant word multiply
__SMUAD, __SMUADX – signed dual multiply add
__SMUSD, __SMUSDX – signed dual multiply subtract

__UMULL, __UMULLS – unsigned 64-bit multiply

Check your compiler documentation for full details on these intrinsics. They can be useful for optimizing multiplies in Cortex-M code.

Multiply Instruction Timing

Here are the typical cycle timings for multiply instructions on Cortex-M processors:

MUL – 1 cycle latency
MLA – 1 cycle latency
SMULxx – 1 cycle latency

UMULL/SMULL – 2 cycle latency

So the 32-bit MUL and MLA instructions are very fast with just 1 cycle latency. The 64-bit UMULL and SMULL have 2 cycle latency so are slower. This timing difference is another reason to prefer MUL when 32-bit precision is enough.

Considerations for Smaller Cortex-M Cores

The smaller Cortex-M cores like Cortex-M0/M0+ do not support all the multiply instructions. Key differences include:

No MLA instruction
Only 16-bit multiplies natively supported
Need to emulate 32-bit multiply with 16-bit instructions

No UMULL or SMULL support
Smaller register file

So for the smaller cores, optimize multiply code to rely more on 16-bit multiplies. Use MUL or compiler intrinsics sparingly when 32-bit multiply is absolutely needed. And consider using shift instructions as well.

Multiplying Floating Point Values

To multiply floating point values, ARM Cortex-M cores provide the VFP or FPU instructions like FMUL, FMLA, etc. These perform single or double precision floating point multiplication.

FMUL performs a 32-bit or 64-bit float multiply: FMULS R1, R2, R3 // Single precision (32-bit) multiply FMULD R1, R2, R3 // Double precision (64-bit) multiply

FMLA performs a floating point multiply accumulate: FMLAS R1, R2, R3 // 32-bit float multiply accumulate FMLAD R1, R2, R3 // 64-bit float multiply accumulate

The floating point multiply instructions are useful when doing digital signal processing, matrix math, 3D math, or other numerically intensive algorithms.

Summary

In summary:

Use 32-bit MUL when 32-bit precision is enough

Use 64-bit UMULL/SMULL when you need 64-bit result
MUL is faster, UMULL/SMULL have better precision
Prefer shifts over multiply when factor is power of two

Use FMUL and FMLA for floating point math
Optimize for 16-bit multiplies on smaller Cortex-M cores

Properly utilizing the ARM Cortex-M multiply instructions can help optimize code for both efficiency and precision in a variety of applications.

What are Multiply instructions (32-bit result/64-bit result) in Arm Cortex-M series?

Overview of ARM Cortex-M Multiply Instructions

When to Use 32-bit vs 64-bit Multiply

32-bit Multiply Instructions

MUL

MLA

SMULBB, SMULBT, SMULTB, SMULTT

64-bit Multiply Instructions

UMULL and UMULLS

SMULL and SMULLS

Multiplying Constants

Choosing between MUL and UMULL/SMULL

Compiler Intrinsics for Multiply Operations

Multiply Instruction Timing

Considerations for Smaller Cortex-M Cores

Multiplying Floating Point Values

Summary

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

What is Serial Wire Viewer (SWV) in Arm Cortex-M?

Flash Patch and Breakpoint Unit (FPB) in Arm Cortex-M Explained

Arm Cortex-M DAP bus and interconnect architecture Explained

Controlling Clocks and PLL for Power Savings in Cortex-M3