SoC
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
  • Arm Cortex M3
  • Contact
Reading: What are Multiply instructions (32-bit result/64-bit result) in Arm Cortex-M series?
SUBSCRIBE
SoCSoC
Font ResizerAa
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Search
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Have an existing account? Sign In
Follow US
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
© S-O-C.ORG, All Rights Reserved.
Arm

What are Multiply instructions (32-bit result/64-bit result) in Arm Cortex-M series?

Scott Allen
Last updated: September 17, 2023 1:43 pm
Scott Allen 9 Min Read
Share
SHARE

The ARM Cortex-M series of processors support various multiply instructions that can produce 32-bit or 64-bit results. These instructions allow efficient multiplication operations on data values in registers. Knowing when to use 32-bit versus 64-bit multiply can help optimize code for performance and precision.

Contents
Overview of ARM Cortex-M Multiply InstructionsWhen to Use 32-bit vs 64-bit Multiply32-bit Multiply InstructionsMULMLASMULBB, SMULBT, SMULTB, SMULTT64-bit Multiply InstructionsUMULL and UMULLSSMULL and SMULLSMultiplying ConstantsChoosing between MUL and UMULL/SMULLCompiler Intrinsics for Multiply OperationsMultiply Instruction TimingConsiderations for Smaller Cortex-M CoresMultiplying Floating Point ValuesSummary

Overview of ARM Cortex-M Multiply Instructions

Here is a quick overview of the main multiply instructions in Cortex-M processors:

  • MUL: 32-bit multiply, 32-bit result
  • MLA: Multiply with accumulate, 32-bit operands, 32-bit result
  • SMULBB, SMULBT, SMULTB, SMULTT: Signed multiply, 32-bit result
  • UMULL, UMULLS: Unsigned 64-bit multiply, 64-bit result
  • SMULL, SMULLS: Signed 64-bit multiply, 64-bit result

The 32-bit multiply instructions like MUL and MLA perform a 32-bit x 32-bit multiply and produce a 32-bit result. This is useful for efficiency when the 32-bit precision is enough.

The 64-bit multiply instructions like UMULL and SMULL perform a 32-bit x 32-bit multiply but produce a 64-bit result. This maintains precision but is less efficient.

When to Use 32-bit vs 64-bit Multiply

Choosing between 32-bit and 64-bit multiply depends on the data types and precision needed:

  • Use 32-bit multiply when multiplying 32-bit (unsigned int, signed int) values where 32-bit precision is enough.
  • Use 64-bit multiply when multiplying 32-bit values but require 64-bit precision for the result.
  • Use 64-bit multiply when multiplying values greater than 2^32 or requiring modulo greater than 2^32.
  • Prefer 32-bit multiply when performance is critical since it requires fewer cycles and registers.
  • Prefer 64-bit multiply when precision is critical since it maintains the full result.

32-bit Multiply Instructions

Let’s look at some common 32-bit multiply instructions in more detail:

MUL

MUL performs an unsigned 32-bit x 32-bit multiply and produces a 32-bit result. For example: MUL R0, R1, R2

This multiplies the unsigned int values in R1 and R2, truncates the result to 32-bit, and stores the result in R0. R1 and R2 remain unchanged.

MLA

MLA performs a signed 32-bit x 32-bit multiply with accumulate. It multiplies two signed 32-bit values, adds a 64-bit accumulate value, and produces a 32-bit result. For example: MLA R0, R1, R2, R3

This multiplies the signed int values in R1 and R2, adds the signed long long accumulate value in R3, truncates the result to 32-bit, and stores the result back in R0.

SMULBB, SMULBT, SMULTB, SMULTT

These perform signed 32-bit x 32-bit multiplies with some operand shifting. For example: SMULBB R1, R2, R3 // R1 = (R2[7:0] * R3[7:0]) << 1 SMULBT R1, R2, R3 // R1 = (R2[15:0] * R3[7:0]) << 1 SMULTB R1, R2, R3 // R1 = (R2[7:0] * R3[15:0]) << 1 SMULTT R1, R2, R3 // R1 = (R2[15:0] * R3[15:0]) << 1

This supports efficient signed multiplies on smaller data types.

64-bit Multiply Instructions

Here are some key 64-bit multiply instructions:

UMULL and UMULLS

UMULL performs an unsigned 32-bit x 32-bit multiply and produces a 64-bit result: UMULL R1, R2, R3, R4 // R1:R2 = R3 * R4 (unsigned long long result)

UMULLS is the setting version that sets condition flags.

SMULL and SMULLS

SMULL performs a signed 32-bit x 32-bit multiply and produces a 64-bit result: SMULL R1, R2, R3, R4 // R1:R2 = R3 * R4 (signed long long result)

SMULLS is the setting version that sets condition flags.

Multiplying Constants

When multiplying by a constant, consider using left shifts instead of multiply. For example: MUL R0, R1, #16

Could be replaced with: LSL R0, R1, #4 // R0 = R1 << 4 = R1 * 16

The LSL (logical shift left) instruction is often more efficient than a constant multiply.

Choosing between MUL and UMULL/SMULL

To summarize, follow these guidelines when choosing between 32-bit and 64-bit multiply in Cortex-M code:

  • Use MUL when 32-bit precision is enough for the multiplication result.
  • Use UMULL/SMULL when you need 64-bit precision for the multiplication result.
  • Use MUL when performance is critical and you only need 32-bit precision.
  • Use UMULL/SMULL for crypto code or multiplications that need modulo > 2^32.
  • Use SMULL for signed multiplies and UMULL for unsigned multiplies.
  • Use shift instructions instead of MUL when multiplying by a power of two.

Proper use of 32-bit versus 64-bit multiply instructions can help optimize Cortex-M code for both performance and precision.

Compiler Intrinsics for Multiply Operations

Here are some compiler intrinsics that map to the multiply instructions:

  • __SMULBB, __SMULBT, __SMULTB, __SMULTT – map to SMULxx instructions
  • __PKHBT, __PKHTB – pack halfwords, useful for multiplies
  • __SMLABB, __SMLABT, __SMLATB, __SMLATT – signed multiply accumulate
  • __SMLAD, __SMLADX – signed multiply accumulate dual
  • __SMLAL, __SMLALBB, __SMLALBT, __SMLALTB, __SMLALTT – 64-bit signed multiply accumulate
  • __SMLALD, __SMLALDX – signed multiply subtract dual accumulate long
  • __SMLAWB, __SMLAWT – signed multiply accumulate with round
  • __SMLSD, __SMLSDX – signed multiply subtract dual
  • __SMLSLD, __SMLSLDX – signed multiply subtract dual accumulate long
  • __SMMLA, __SMMLAR – signed most significant word multiply accumulate
  • __SMMLS, __SMMLSR – signed most significant word multiply subtract
  • __SMMLSR – signed most significant word multiply subtract reversed
  • __SMMUL, __SMMULR – signed most significant word multiply
  • __SMUAD, __SMUADX – signed dual multiply add
  • __SMUSD, __SMUSDX – signed dual multiply subtract
  • __UMULL, __UMULLS – unsigned 64-bit multiply

Check your compiler documentation for full details on these intrinsics. They can be useful for optimizing multiplies in Cortex-M code.

Multiply Instruction Timing

Here are the typical cycle timings for multiply instructions on Cortex-M processors:

  • MUL – 1 cycle latency
  • MLA – 1 cycle latency
  • SMULxx – 1 cycle latency
  • UMULL/SMULL – 2 cycle latency

So the 32-bit MUL and MLA instructions are very fast with just 1 cycle latency. The 64-bit UMULL and SMULL have 2 cycle latency so are slower. This timing difference is another reason to prefer MUL when 32-bit precision is enough.

Considerations for Smaller Cortex-M Cores

The smaller Cortex-M cores like Cortex-M0/M0+ do not support all the multiply instructions. Key differences include:

  • No MLA instruction
  • Only 16-bit multiplies natively supported
  • Need to emulate 32-bit multiply with 16-bit instructions
  • No UMULL or SMULL support
  • Smaller register file

So for the smaller cores, optimize multiply code to rely more on 16-bit multiplies. Use MUL or compiler intrinsics sparingly when 32-bit multiply is absolutely needed. And consider using shift instructions as well.

Multiplying Floating Point Values

To multiply floating point values, ARM Cortex-M cores provide the VFP or FPU instructions like FMUL, FMLA, etc. These perform single or double precision floating point multiplication.

FMUL performs a 32-bit or 64-bit float multiply: FMULS R1, R2, R3 // Single precision (32-bit) multiply FMULD R1, R2, R3 // Double precision (64-bit) multiply

FMLA performs a floating point multiply accumulate: FMLAS R1, R2, R3 // 32-bit float multiply accumulate FMLAD R1, R2, R3 // 64-bit float multiply accumulate

The floating point multiply instructions are useful when doing digital signal processing, matrix math, 3D math, or other numerically intensive algorithms.

Summary

In summary:

  • Use 32-bit MUL when 32-bit precision is enough
  • Use 64-bit UMULL/SMULL when you need 64-bit result
  • MUL is faster, UMULL/SMULL have better precision
  • Prefer shifts over multiply when factor is power of two
  • Use FMUL and FMLA for floating point math
  • Optimize for 16-bit multiplies on smaller Cortex-M cores

Properly utilizing the ARM Cortex-M multiply instructions can help optimize code for both efficiency and precision in a variety of applications.

Newsletter Form (#3)

More ARM insights right in your inbox

 


Share This Article
Facebook Twitter Email Copy Link Print
Previous Article What are Thumb-2 instructions in Arm Cortex-M series?
Next Article What are Divide instructions (32-bit quotient) in Arm Cortex-M series?
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

2k Followers Like
3k Followers Follow
10.1k Followers Pin
- Sponsored-
Ad image

You Might Also Like

ARM Processor Interview Questions and Answers

The ARM processor architecture is widely used in embedded systems…

13 Min Read

Code vs Data Memory Partitioning in Microcontrollers

Microcontrollers have limited amounts of memory available, so it is…

12 Min Read

Integrating Cortex-M1 with JTAG debugger

The Cortex-M1 processor from ARM can be debugged using a…

6 Min Read

What is the difference between ARM MVE and neon?

ARM-based processors have long included SIMD instructions to improve performance…

7 Min Read
SoCSoC
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
Welcome Back!

Sign in to your account