What are Multiply instructions (32-bit result/64-bit result) in Arm Cortex-M series?

The ARM Cortex-M series of processors support various multiply instructions that can produce 32-bit or 64-bit results. These instructions allow efficient multiplication operations on data values in registers. Knowing when to use 32-bit versus 64-bit multiply can help optimize code for performance and precision.

Contents

Overview of ARM Cortex-M Multiply Instructions When to Use 32-bit vs 64-bit Multiply 32-bit Multiply Instructions MUL MLA SMULBB, SMULBT, SMULTB, SMULTT 64-bit Multiply Instructions UMULL and UMULLS SMULL and SMULLS Multiplying Constants Choosing between MUL and UMULL/SMULL Compiler Intrinsics for Multiply Operations Multiply Instruction Timing Considerations for Smaller Cortex-M Cores Multiplying Floating Point Values Summary

Overview of ARM Cortex-M Multiply Instructions

Here is a quick overview of the main multiply instructions in Cortex-M processors:

MUL: 32-bit multiply, 32-bit result

MLA: Multiply with accumulate, 32-bit operands, 32-bit result
SMULBB, SMULBT, SMULTB, SMULTT: Signed multiply, 32-bit result
UMULL, UMULLS: Unsigned 64-bit multiply, 64-bit result

SMULL, SMULLS: Signed 64-bit multiply, 64-bit result

The 32-bit multiply instructions like MUL and MLA perform a 32-bit x 32-bit multiply and produce a 32-bit result. This is useful for efficiency when the 32-bit precision is enough.

The 64-bit multiply instructions like UMULL and SMULL perform a 32-bit x 32-bit multiply but produce a 64-bit result. This maintains precision but is less efficient.

When to Use 32-bit vs 64-bit Multiply

Choosing between 32-bit and 64-bit multiply depends on the data types and precision needed:

Use 32-bit multiply when multiplying 32-bit (unsigned int, signed int) values where 32-bit precision is enough.
Use 64-bit multiply when multiplying 32-bit values but require 64-bit precision for the result.

Use 64-bit multiply when multiplying values greater than 2^32 or requiring modulo greater than 2^32.
Prefer 32-bit multiply when performance is critical since it requires fewer cycles and registers.
Prefer 64-bit multiply when precision is critical since it maintains the full result.

32-bit Multiply Instructions

Let’s look at some common 32-bit multiply instructions in more detail:

MUL

MUL performs an unsigned 32-bit x 32-bit multiply and produces a 32-bit result. For example: MUL R0, R1, R2

This multiplies the unsigned int values in R1 and R2, truncates the result to 32-bit, and stores the result in R0. R1 and R2 remain unchanged.

MLA

MLA performs a signed 32-bit x 32-bit multiply with accumulate. It multiplies two signed 32-bit values, adds a 64-bit accumulate value, and produces a 32-bit result. For example: MLA R0, R1, R2, R3

This multiplies the signed int values in R1 and R2, adds the signed long long accumulate value in R3, truncates the result to 32-bit, and stores the result back in R0.

SMULBB, SMULBT, SMULTB, SMULTT

These perform signed 32-bit x 32-bit multiplies with some operand shifting. For example: SMULBB R1, R2, R3 // R1 = (R2[7:0] * R3[7:0]) << 1 SMULBT R1, R2, R3 // R1 = (R2[15:0] * R3[7:0]) << 1 SMULTB R1, R2, R3 // R1 = (R2[7:0] * R3[15:0]) << 1 SMULTT R1, R2, R3 // R1 = (R2[15:0] * R3[15:0]) << 1

This supports efficient signed multiplies on smaller data types.

64-bit Multiply Instructions

Here are some key 64-bit multiply instructions:

UMULL and UMULLS

UMULL performs an unsigned 32-bit x 32-bit multiply and produces a 64-bit result: UMULL R1, R2, R3, R4 // R1:R2 = R3 * R4 (unsigned long long result)

UMULLS is the setting version that sets condition flags.

SMULL and SMULLS

SMULL performs a signed 32-bit x 32-bit multiply and produces a 64-bit result: SMULL R1, R2, R3, R4 // R1:R2 = R3 * R4 (signed long long result)

SMULLS is the setting version that sets condition flags.

Multiplying Constants

When multiplying by a constant, consider using left shifts instead of multiply. For example: MUL R0, R1, #16

Could be replaced with: LSL R0, R1, #4 // R0 = R1 << 4 = R1 * 16

The LSL (logical shift left) instruction is often more efficient than a constant multiply.

Choosing between MUL and UMULL/SMULL

To summarize, follow these guidelines when choosing between 32-bit and 64-bit multiply in Cortex-M code:

Use MUL when 32-bit precision is enough for the multiplication result.
Use UMULL/SMULL when you need 64-bit precision for the multiplication result.

Use MUL when performance is critical and you only need 32-bit precision.
Use UMULL/SMULL for crypto code or multiplications that need modulo > 2^32.
Use SMULL for signed multiplies and UMULL for unsigned multiplies.

Use shift instructions instead of MUL when multiplying by a power of two.

Proper use of 32-bit versus 64-bit multiply instructions can help optimize Cortex-M code for both performance and precision.

Compiler Intrinsics for Multiply Operations

Here are some compiler intrinsics that map to the multiply instructions:

__SMULBB, __SMULBT, __SMULTB, __SMULTT – map to SMULxx instructions
__PKHBT, __PKHTB – pack halfwords, useful for multiplies
__SMLABB, __SMLABT, __SMLATB, __SMLATT – signed multiply accumulate

__SMLAD, __SMLADX – signed multiply accumulate dual
__SMLAL, __SMLALBB, __SMLALBT, __SMLALTB, __SMLALTT – 64-bit signed multiply accumulate
__SMLALD, __SMLALDX – signed multiply subtract dual accumulate long

__SMLAWB, __SMLAWT – signed multiply accumulate with round
__SMLSD, __SMLSDX – signed multiply subtract dual
__SMLSLD, __SMLSLDX – signed multiply subtract dual accumulate long

__SMMLA, __SMMLAR – signed most significant word multiply accumulate
__SMMLS, __SMMLSR – signed most significant word multiply subtract
__SMMLSR – signed most significant word multiply subtract reversed

__SMMUL, __SMMULR – signed most significant word multiply
__SMUAD, __SMUADX – signed dual multiply add
__SMUSD, __SMUSDX – signed dual multiply subtract

__UMULL, __UMULLS – unsigned 64-bit multiply

Check your compiler documentation for full details on these intrinsics. They can be useful for optimizing multiplies in Cortex-M code.

Multiply Instruction Timing

Here are the typical cycle timings for multiply instructions on Cortex-M processors:

MUL – 1 cycle latency
MLA – 1 cycle latency
SMULxx – 1 cycle latency

UMULL/SMULL – 2 cycle latency

So the 32-bit MUL and MLA instructions are very fast with just 1 cycle latency. The 64-bit UMULL and SMULL have 2 cycle latency so are slower. This timing difference is another reason to prefer MUL when 32-bit precision is enough.

Considerations for Smaller Cortex-M Cores

The smaller Cortex-M cores like Cortex-M0/M0+ do not support all the multiply instructions. Key differences include:

No MLA instruction
Only 16-bit multiplies natively supported
Need to emulate 32-bit multiply with 16-bit instructions

No UMULL or SMULL support
Smaller register file

So for the smaller cores, optimize multiply code to rely more on 16-bit multiplies. Use MUL or compiler intrinsics sparingly when 32-bit multiply is absolutely needed. And consider using shift instructions as well.

Multiplying Floating Point Values

To multiply floating point values, ARM Cortex-M cores provide the VFP or FPU instructions like FMUL, FMLA, etc. These perform single or double precision floating point multiplication.

FMUL performs a 32-bit or 64-bit float multiply: FMULS R1, R2, R3 // Single precision (32-bit) multiply FMULD R1, R2, R3 // Double precision (64-bit) multiply

FMLA performs a floating point multiply accumulate: FMLAS R1, R2, R3 // 32-bit float multiply accumulate FMLAD R1, R2, R3 // 64-bit float multiply accumulate

The floating point multiply instructions are useful when doing digital signal processing, matrix math, 3D math, or other numerically intensive algorithms.

Summary

In summary:

Use 32-bit MUL when 32-bit precision is enough

Use 64-bit UMULL/SMULL when you need 64-bit result
MUL is faster, UMULL/SMULL have better precision
Prefer shifts over multiply when factor is power of two

Use FMUL and FMLA for floating point math
Optimize for 16-bit multiplies on smaller Cortex-M cores

Properly utilizing the ARM Cortex-M multiply instructions can help optimize code for both efficiency and precision in a variety of applications.

What are Multiply instructions (32-bit result/64-bit result) in Arm Cortex-M series?

Overview of ARM Cortex-M Multiply Instructions

When to Use 32-bit vs 64-bit Multiply

32-bit Multiply Instructions

MUL

MLA

SMULBB, SMULBT, SMULTB, SMULTT

64-bit Multiply Instructions

UMULL and UMULLS

SMULL and SMULLS

Multiplying Constants

Choosing between MUL and UMULL/SMULL

Compiler Intrinsics for Multiply Operations

Multiply Instruction Timing

Considerations for Smaller Cortex-M Cores

Multiplying Floating Point Values

Summary

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

Does Apple use ARM chips?

How to use Cmsis DSP library in Keil?

ARM Cortex M0 Programming Tutorial

Why is ARM more efficient than x86?