The Cortex-M0 is an ultra low power 32-bit ARM processor core designed for microcontroller applications. It is optimized to achieve high performance and energy efficiency in embedded systems that require minimal silicon area. One of the key features of the Cortex-M0 is its high speed integer multiplier which can perform a 32×32 multiply in a single cycle.
Cortex-M0 Architecture Overview
The Cortex-M0 is a 3-stage scalar pipeline processor with dual 16-bit multiply accumulate (MAC) hardware. It has a 32-bit ALU, 32-bit multiplier, barrel shifter, bit-banding, and saturating arithmetic logic. The processor includes 32KB to 64KB of embedded SRAM which serves as tight coupled memory for code and data. The Cortex-M0 implements the ARMv6-M Thumb instruction set which includes both 16-bit and 32-bit instructions.
Integer Multiplier
The integer multiplier in the Cortex-M0 is fully pipelined and can perform a 32×32 multiply in a single cycle with no stalls. It supports multiply, multiply-accumulate, and multiply-subtract operations on 8-bit, 16-bit and 32-bit operands. The multiplier takes in two 32-bit operands and produces a 32-bit result which is written to a dedicated 32-bit product register. This enables back-to-back multiply operations without delays.
Multiply Instruction
The MUL instruction performs an unsigned 32×32 multiply of two register operands and stores the result in a destination register. The syntax is: MUL{S} {Rd,} Rn, Rm
Where:
- S (optional) – Update status flags
- Rd – Destination register for result
- Rn – First operand
- Rm – Second operand
For example: MUL R1, R2, R3 ; R1 = R2 * R3
This multiplies the unsigned values in R2 and R3 and stores the result in R1. The flags are not updated. Since the Cortex-M0 multiplier is pipelined, this MUL takes just 1 cycle to execute regardless of the operand values.
Signed and Unsigned Behavior
The MUL instruction always performs an unsigned integer multiply. However, the result can be interpreted as signed or unsigned depending on the instructions that use it. For example: MUL R1, R2, R3 ; Unsigned multiply CMP R1, #0 ; Compare R1 against 0
This will treat R1 as an unsigned 32-bit value for the comparison. But if we do: MUL R1, R2, R3 CMN R1, #1 ; Compare negative R1 against -1
Then R1 is treated as a signed 2’s complement value. So the same MUL result can be used in both signed and unsigned contexts.
Signed Multiply Behavior
When using the MUL result in a signed context, it correctly implements 2’s complement signed multiplication. This means:
- Negative numbers are represented in 2’s complement form
- The sign bit is extended into the upper bits during multiply
- The signed result is modulo 2^32
For example: MUL R1, #0x80000000, R2 ; R1 = -2147483648 * R2
This will properly sign extend the first operand and store the correct signed result in R1.
Overflow Detection
The MUL instruction does not set overflow or carry flags itself. However, overflow can be detected by checking the carry out of bit 31 of the result: MULS R1, R2, R3 ; Signed multiply MOVS R0, R1 ; Copy R1 to R0, setting flags BCS overflow ; Branch if carry set (bit 31 carry)
The carry will be set if bit 31 of R1 is not the sign bit of the true mathematical result. This indicates an overflow.
Multiply-Accumulate
The Cortex-M0 supports fused multiply-accumulate operations with the MLA instruction: MLA{S} {Rd,} Rn, Rm, Ra
This multiplies Rn and Rm, adds the accumulate value Ra, and stores the result in Rd. For example: MLA R1, R2, R3, R4 ; R1 = R2 * R3 + R4
This does the multiply and accumulate in 1 cycle with no stalls. Overflow can be detected by checking the carry flag as with a normal MUL instruction.
Multiply-Subtract
Similarly, the MLS instruction does a fused multiply-subtract operation: MLS{S} {Rd,} Rn, Rm, Ra
This multiplies Rn and Rm, subtracts Ra from the product, and stores the result in Rd. For example: MLS R1, R2, R3, R4 ; R1 = R2 * R3 – R4
This multiply-subtract takes just 1 cycle on the Cortex-M0.
Long Multiplies
The SMULL and UMULL instructions can perform long multiplies to produce 64-bit results: SMULL RdLo, RdHi, Rn, Rm UMULL RdLo, RdHi, Rn, Rm
This multiplies Rn and Rm as signed or unsigned 32-bit values. The lower 32-bits of the 64-bit result are stored in RdLo and the upper 32-bits are stored in RdHi. For example: SMULL R0, R1, R2, R3 ; Signed long multiply UMULL R0, R1, R2, R3 ; Unsigned long multiply
On the Cortex-M0, these long multiplies take just 1 cycle to execute.
Summary of Multiply Cycles
To summarize the multiply cycle counts on the Cortex-M0:
- MUL takes 1 cycle for 32-bit x 32-bit multiply
- MLA takes 1 cycle for multiply-accumulate
- MLS takes 1 cycle for multiply-subtract
- SMULL takes 1 cycle for signed 64-bit multiply
- UMULL takes 1 cycle for unsigned 64-bit multiply
The Cortex-M0 integer multiplier is highly optimized to deliver single-cycle throughput for all multiply and multiply-accumulate operations. This makes it well suited for digital signal processing and other math intensive applications.