SoC
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
  • Arm Cortex M3
  • Contact
Reading: What are Multiply instructions (32-bit result/64-bit result) in Arm Cortex-M series?
SUBSCRIBE
SoCSoC
Font ResizerAa
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Search
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Have an existing account? Sign In
Follow US
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
© S-O-C.ORG, All Rights Reserved.
Arm

What are Multiply instructions (32-bit result/64-bit result) in Arm Cortex-M series?

Scott Allen
Last updated: September 17, 2023 1:43 pm
Scott Allen 9 Min Read
Share
SHARE

The ARM Cortex-M series of processors support various multiply instructions that can produce 32-bit or 64-bit results. These instructions allow efficient multiplication operations on data values in registers. Knowing when to use 32-bit versus 64-bit multiply can help optimize code for performance and precision.

Contents
Overview of ARM Cortex-M Multiply InstructionsWhen to Use 32-bit vs 64-bit Multiply32-bit Multiply InstructionsMULMLASMULBB, SMULBT, SMULTB, SMULTT64-bit Multiply InstructionsUMULL and UMULLSSMULL and SMULLSMultiplying ConstantsChoosing between MUL and UMULL/SMULLCompiler Intrinsics for Multiply OperationsMultiply Instruction TimingConsiderations for Smaller Cortex-M CoresMultiplying Floating Point ValuesSummary

Overview of ARM Cortex-M Multiply Instructions

Here is a quick overview of the main multiply instructions in Cortex-M processors:

  • MUL: 32-bit multiply, 32-bit result
  • MLA: Multiply with accumulate, 32-bit operands, 32-bit result
  • SMULBB, SMULBT, SMULTB, SMULTT: Signed multiply, 32-bit result
  • UMULL, UMULLS: Unsigned 64-bit multiply, 64-bit result
  • SMULL, SMULLS: Signed 64-bit multiply, 64-bit result

The 32-bit multiply instructions like MUL and MLA perform a 32-bit x 32-bit multiply and produce a 32-bit result. This is useful for efficiency when the 32-bit precision is enough.

The 64-bit multiply instructions like UMULL and SMULL perform a 32-bit x 32-bit multiply but produce a 64-bit result. This maintains precision but is less efficient.

When to Use 32-bit vs 64-bit Multiply

Choosing between 32-bit and 64-bit multiply depends on the data types and precision needed:

  • Use 32-bit multiply when multiplying 32-bit (unsigned int, signed int) values where 32-bit precision is enough.
  • Use 64-bit multiply when multiplying 32-bit values but require 64-bit precision for the result.
  • Use 64-bit multiply when multiplying values greater than 2^32 or requiring modulo greater than 2^32.
  • Prefer 32-bit multiply when performance is critical since it requires fewer cycles and registers.
  • Prefer 64-bit multiply when precision is critical since it maintains the full result.

32-bit Multiply Instructions

Let’s look at some common 32-bit multiply instructions in more detail:

MUL

MUL performs an unsigned 32-bit x 32-bit multiply and produces a 32-bit result. For example: MUL R0, R1, R2

This multiplies the unsigned int values in R1 and R2, truncates the result to 32-bit, and stores the result in R0. R1 and R2 remain unchanged.

MLA

MLA performs a signed 32-bit x 32-bit multiply with accumulate. It multiplies two signed 32-bit values, adds a 64-bit accumulate value, and produces a 32-bit result. For example: MLA R0, R1, R2, R3

This multiplies the signed int values in R1 and R2, adds the signed long long accumulate value in R3, truncates the result to 32-bit, and stores the result back in R0.

SMULBB, SMULBT, SMULTB, SMULTT

These perform signed 32-bit x 32-bit multiplies with some operand shifting. For example: SMULBB R1, R2, R3 // R1 = (R2[7:0] * R3[7:0]) << 1 SMULBT R1, R2, R3 // R1 = (R2[15:0] * R3[7:0]) << 1 SMULTB R1, R2, R3 // R1 = (R2[7:0] * R3[15:0]) << 1 SMULTT R1, R2, R3 // R1 = (R2[15:0] * R3[15:0]) << 1

This supports efficient signed multiplies on smaller data types.

64-bit Multiply Instructions

Here are some key 64-bit multiply instructions:

UMULL and UMULLS

UMULL performs an unsigned 32-bit x 32-bit multiply and produces a 64-bit result: UMULL R1, R2, R3, R4 // R1:R2 = R3 * R4 (unsigned long long result)

UMULLS is the setting version that sets condition flags.

SMULL and SMULLS

SMULL performs a signed 32-bit x 32-bit multiply and produces a 64-bit result: SMULL R1, R2, R3, R4 // R1:R2 = R3 * R4 (signed long long result)

SMULLS is the setting version that sets condition flags.

Multiplying Constants

When multiplying by a constant, consider using left shifts instead of multiply. For example: MUL R0, R1, #16

Could be replaced with: LSL R0, R1, #4 // R0 = R1 << 4 = R1 * 16

The LSL (logical shift left) instruction is often more efficient than a constant multiply.

Choosing between MUL and UMULL/SMULL

To summarize, follow these guidelines when choosing between 32-bit and 64-bit multiply in Cortex-M code:

  • Use MUL when 32-bit precision is enough for the multiplication result.
  • Use UMULL/SMULL when you need 64-bit precision for the multiplication result.
  • Use MUL when performance is critical and you only need 32-bit precision.
  • Use UMULL/SMULL for crypto code or multiplications that need modulo > 2^32.
  • Use SMULL for signed multiplies and UMULL for unsigned multiplies.
  • Use shift instructions instead of MUL when multiplying by a power of two.

Proper use of 32-bit versus 64-bit multiply instructions can help optimize Cortex-M code for both performance and precision.

Compiler Intrinsics for Multiply Operations

Here are some compiler intrinsics that map to the multiply instructions:

  • __SMULBB, __SMULBT, __SMULTB, __SMULTT – map to SMULxx instructions
  • __PKHBT, __PKHTB – pack halfwords, useful for multiplies
  • __SMLABB, __SMLABT, __SMLATB, __SMLATT – signed multiply accumulate
  • __SMLAD, __SMLADX – signed multiply accumulate dual
  • __SMLAL, __SMLALBB, __SMLALBT, __SMLALTB, __SMLALTT – 64-bit signed multiply accumulate
  • __SMLALD, __SMLALDX – signed multiply subtract dual accumulate long
  • __SMLAWB, __SMLAWT – signed multiply accumulate with round
  • __SMLSD, __SMLSDX – signed multiply subtract dual
  • __SMLSLD, __SMLSLDX – signed multiply subtract dual accumulate long
  • __SMMLA, __SMMLAR – signed most significant word multiply accumulate
  • __SMMLS, __SMMLSR – signed most significant word multiply subtract
  • __SMMLSR – signed most significant word multiply subtract reversed
  • __SMMUL, __SMMULR – signed most significant word multiply
  • __SMUAD, __SMUADX – signed dual multiply add
  • __SMUSD, __SMUSDX – signed dual multiply subtract
  • __UMULL, __UMULLS – unsigned 64-bit multiply

Check your compiler documentation for full details on these intrinsics. They can be useful for optimizing multiplies in Cortex-M code.

Multiply Instruction Timing

Here are the typical cycle timings for multiply instructions on Cortex-M processors:

  • MUL – 1 cycle latency
  • MLA – 1 cycle latency
  • SMULxx – 1 cycle latency
  • UMULL/SMULL – 2 cycle latency

So the 32-bit MUL and MLA instructions are very fast with just 1 cycle latency. The 64-bit UMULL and SMULL have 2 cycle latency so are slower. This timing difference is another reason to prefer MUL when 32-bit precision is enough.

Considerations for Smaller Cortex-M Cores

The smaller Cortex-M cores like Cortex-M0/M0+ do not support all the multiply instructions. Key differences include:

  • No MLA instruction
  • Only 16-bit multiplies natively supported
  • Need to emulate 32-bit multiply with 16-bit instructions
  • No UMULL or SMULL support
  • Smaller register file

So for the smaller cores, optimize multiply code to rely more on 16-bit multiplies. Use MUL or compiler intrinsics sparingly when 32-bit multiply is absolutely needed. And consider using shift instructions as well.

Multiplying Floating Point Values

To multiply floating point values, ARM Cortex-M cores provide the VFP or FPU instructions like FMUL, FMLA, etc. These perform single or double precision floating point multiplication.

FMUL performs a 32-bit or 64-bit float multiply: FMULS R1, R2, R3 // Single precision (32-bit) multiply FMULD R1, R2, R3 // Double precision (64-bit) multiply

FMLA performs a floating point multiply accumulate: FMLAS R1, R2, R3 // 32-bit float multiply accumulate FMLAD R1, R2, R3 // 64-bit float multiply accumulate

The floating point multiply instructions are useful when doing digital signal processing, matrix math, 3D math, or other numerically intensive algorithms.

Summary

In summary:

  • Use 32-bit MUL when 32-bit precision is enough
  • Use 64-bit UMULL/SMULL when you need 64-bit result
  • MUL is faster, UMULL/SMULL have better precision
  • Prefer shifts over multiply when factor is power of two
  • Use FMUL and FMLA for floating point math
  • Optimize for 16-bit multiplies on smaller Cortex-M cores

Properly utilizing the ARM Cortex-M multiply instructions can help optimize code for both efficiency and precision in a variety of applications.

Newsletter Form (#3)

More ARM insights right in your inbox

 


Share This Article
Facebook Twitter Email Copy Link Print
Previous Article What are Thumb-2 instructions in Arm Cortex-M series?
Next Article What are Divide instructions (32-bit quotient) in Arm Cortex-M series?
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

2k Followers Like
3k Followers Follow
10.1k Followers Pin
- Sponsored-
Ad image

You Might Also Like

What is ARM Cortex-R7?

The ARM Cortex-R7 is a high-performance real-time processor core designed…

8 Min Read

Primask Register in Cortex-M4

The PRIMASK register is one of the special-purpose program status…

7 Min Read

What instruction set do Cortex-M processors use?

Cortex-M processors use the Thumb instruction set, which is a…

7 Min Read

ARM processors were basically designed for

ARM processors were originally designed and optimized for low power…

8 Min Read
SoCSoC
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
Welcome Back!

Sign in to your account