The ARM Cortex-M series of processors support various multiply instructions that can produce 32-bit or 64-bit results. These instructions allow efficient multiplication operations on data values in registers. Knowing when to use 32-bit versus 64-bit multiply can help optimize code for performance and precision.

## Overview of ARM Cortex-M Multiply Instructions

Here is a quick overview of the main multiply instructions in Cortex-M processors:

- MUL: 32-bit multiply, 32-bit result
- MLA: Multiply with accumulate, 32-bit operands, 32-bit result
- SMULBB, SMULBT, SMULTB, SMULTT: Signed multiply, 32-bit result
- UMULL, UMULLS: Unsigned 64-bit multiply, 64-bit result
- SMULL, SMULLS: Signed 64-bit multiply, 64-bit result

The 32-bit multiply instructions like MUL and MLA perform a 32-bit x 32-bit multiply and produce a 32-bit result. This is useful for efficiency when the 32-bit precision is enough.

The 64-bit multiply instructions like UMULL and SMULL perform a 32-bit x 32-bit multiply but produce a 64-bit result. This maintains precision but is less efficient.

## When to Use 32-bit vs 64-bit Multiply

Choosing between 32-bit and 64-bit multiply depends on the data types and precision needed:

- Use 32-bit multiply when multiplying 32-bit (unsigned int, signed int) values where 32-bit precision is enough.
- Use 64-bit multiply when multiplying 32-bit values but require 64-bit precision for the result.
- Use 64-bit multiply when multiplying values greater than 2^32 or requiring modulo greater than 2^32.
- Prefer 32-bit multiply when performance is critical since it requires fewer cycles and registers.
- Prefer 64-bit multiply when precision is critical since it maintains the full result.

## 32-bit Multiply Instructions

Let’s look at some common 32-bit multiply instructions in more detail:

### MUL

MUL performs an unsigned 32-bit x 32-bit multiply and produces a 32-bit result. For example: MUL R0, R1, R2

This multiplies the unsigned int values in R1 and R2, truncates the result to 32-bit, and stores the result in R0. R1 and R2 remain unchanged.

### MLA

MLA performs a signed 32-bit x 32-bit multiply with accumulate. It multiplies two signed 32-bit values, adds a 64-bit accumulate value, and produces a 32-bit result. For example: MLA R0, R1, R2, R3

This multiplies the signed int values in R1 and R2, adds the signed long long accumulate value in R3, truncates the result to 32-bit, and stores the result back in R0.

### SMULBB, SMULBT, SMULTB, SMULTT

These perform signed 32-bit x 32-bit multiplies with some operand shifting. For example: SMULBB R1, R2, R3 // R1 = (R2[7:0] * R3[7:0]) << 1 SMULBT R1, R2, R3 // R1 = (R2[15:0] * R3[7:0]) << 1 SMULTB R1, R2, R3 // R1 = (R2[7:0] * R3[15:0]) << 1 SMULTT R1, R2, R3 // R1 = (R2[15:0] * R3[15:0]) << 1

This supports efficient signed multiplies on smaller data types.

## 64-bit Multiply Instructions

Here are some key 64-bit multiply instructions:

### UMULL and UMULLS

UMULL performs an unsigned 32-bit x 32-bit multiply and produces a 64-bit result: UMULL R1, R2, R3, R4 // R1:R2 = R3 * R4 (unsigned long long result)

UMULLS is the setting version that sets condition flags.

### SMULL and SMULLS

SMULL performs a signed 32-bit x 32-bit multiply and produces a 64-bit result: SMULL R1, R2, R3, R4 // R1:R2 = R3 * R4 (signed long long result)

SMULLS is the setting version that sets condition flags.

## Multiplying Constants

When multiplying by a constant, consider using left shifts instead of multiply. For example: MUL R0, R1, #16

Could be replaced with: LSL R0, R1, #4 // R0 = R1 << 4 = R1 * 16

The LSL (logical shift left) instruction is often more efficient than a constant multiply.

## Choosing between MUL and UMULL/SMULL

To summarize, follow these guidelines when choosing between 32-bit and 64-bit multiply in Cortex-M code:

- Use MUL when 32-bit precision is enough for the multiplication result.
- Use UMULL/SMULL when you need 64-bit precision for the multiplication result.
- Use MUL when performance is critical and you only need 32-bit precision.
- Use UMULL/SMULL for crypto code or multiplications that need modulo > 2^32.
- Use SMULL for signed multiplies and UMULL for unsigned multiplies.
- Use shift instructions instead of MUL when multiplying by a power of two.

Proper use of 32-bit versus 64-bit multiply instructions can help optimize Cortex-M code for both performance and precision.

## Compiler Intrinsics for Multiply Operations

Here are some compiler intrinsics that map to the multiply instructions:

- __SMULBB, __SMULBT, __SMULTB, __SMULTT – map to SMULxx instructions
- __PKHBT, __PKHTB – pack halfwords, useful for multiplies
- __SMLABB, __SMLABT, __SMLATB, __SMLATT – signed multiply accumulate
- __SMLAD, __SMLADX – signed multiply accumulate dual
- __SMLAL, __SMLALBB, __SMLALBT, __SMLALTB, __SMLALTT – 64-bit signed multiply accumulate
- __SMLALD, __SMLALDX – signed multiply subtract dual accumulate long
- __SMLAWB, __SMLAWT – signed multiply accumulate with round
- __SMLSD, __SMLSDX – signed multiply subtract dual
- __SMLSLD, __SMLSLDX – signed multiply subtract dual accumulate long
- __SMMLA, __SMMLAR – signed most significant word multiply accumulate
- __SMMLS, __SMMLSR – signed most significant word multiply subtract
- __SMMLSR – signed most significant word multiply subtract reversed
- __SMMUL, __SMMULR – signed most significant word multiply
- __SMUAD, __SMUADX – signed dual multiply add
- __SMUSD, __SMUSDX – signed dual multiply subtract
- __UMULL, __UMULLS – unsigned 64-bit multiply

Check your compiler documentation for full details on these intrinsics. They can be useful for optimizing multiplies in Cortex-M code.

## Multiply Instruction Timing

Here are the typical cycle timings for multiply instructions on Cortex-M processors:

- MUL – 1 cycle latency
- MLA – 1 cycle latency
- SMULxx – 1 cycle latency
- UMULL/SMULL – 2 cycle latency

So the 32-bit MUL and MLA instructions are very fast with just 1 cycle latency. The 64-bit UMULL and SMULL have 2 cycle latency so are slower. This timing difference is another reason to prefer MUL when 32-bit precision is enough.

## Considerations for Smaller Cortex-M Cores

The smaller Cortex-M cores like Cortex-M0/M0+ do not support all the multiply instructions. Key differences include:

- No MLA instruction
- Only 16-bit multiplies natively supported
- Need to emulate 32-bit multiply with 16-bit instructions
- No UMULL or SMULL support
- Smaller register file

So for the smaller cores, optimize multiply code to rely more on 16-bit multiplies. Use MUL or compiler intrinsics sparingly when 32-bit multiply is absolutely needed. And consider using shift instructions as well.

## Multiplying Floating Point Values

To multiply floating point values, ARM Cortex-M cores provide the VFP or FPU instructions like FMUL, FMLA, etc. These perform single or double precision floating point multiplication.

FMUL performs a 32-bit or 64-bit float multiply: FMULS R1, R2, R3 // Single precision (32-bit) multiply FMULD R1, R2, R3 // Double precision (64-bit) multiply

FMLA performs a floating point multiply accumulate: FMLAS R1, R2, R3 // 32-bit float multiply accumulate FMLAD R1, R2, R3 // 64-bit float multiply accumulate

The floating point multiply instructions are useful when doing digital signal processing, matrix math, 3D math, or other numerically intensive algorithms.

## Summary

In summary:

- Use 32-bit MUL when 32-bit precision is enough
- Use 64-bit UMULL/SMULL when you need 64-bit result
- MUL is faster, UMULL/SMULL have better precision
- Prefer shifts over multiply when factor is power of two
- Use FMUL and FMLA for floating point math
- Optimize for 16-bit multiplies on smaller Cortex-M cores

Properly utilizing the ARM Cortex-M multiply instructions can help optimize code for both efficiency and precision in a variety of applications.