The ARM Cortex-M0 is one of the most popular microcontroller cores used in IoT and embedded devices today. It is an extremely energy-efficient 32-bit RISC processor optimized for low-cost and low-power applications. When looking at the performance of the Cortex-M0, one key metric is how many clock cycles it takes to execute common instructions like floating point multiplies. In summary, the Cortex-M0 takes between 2-34 cycles to multiply two single-precision (32-bit) floating point values, depending on whether the values are already in registers and other optimization factors.
Cortex-M0 Floating Point Unit
The Cortex-M0 has an optional single-precision floating point unit (FPU) that supports the ARMv6-M Floating Point instruction set. This includes instructions for addition, subtraction, multiplication, division, comparison, and data conversion between floating point and integer values. When the FPU is included, floating point operations like multiplication can be performed directly in hardware rather than needing to use software emulation.
Single-precision floats on the Cortex-M0 are 32 bits wide and follow the IEEE 754-2008 standard format. This provides an 8-bit exponent and 23-bit mantissa, giving a numeric range of approximately 1.2e-38 to 3.4e38 with about 7 decimal digits of precision. The FPU in the Cortex-M0 implements multiply (FMUL) and multiply-accumulate (FMAC) instructions that operate on these 32-bit float values.
Cycle Counts for Floating Point Multiply
The number of cycles it takes the Cortex-M0 FPU to execute a floating point multiply depends on several factors:
- Whether the values are already in registers vs. in memory
- Pipelining and instruction ordering optimizations
- Compiler optimizations like loop unrolling
- Memory wait states configured for the microcontroller
Let’s take a look at some examples to get an idea of the cycle ranges:
1. Immediate to Register Multiply
One of the fastest ways to do a single float multiply on Cortex-M0 is using an immediate value and a register. For example: FMULS R0, R1, #3.1415
This will multiply the contents of R1 by the immediate constant 3.1415 and store the result in R0. An instruction like this takes just 2 cycles to execute on the Cortex-M0 FPU.
2. Register to Register Multiply
Doing a multiply between two floating point register values takes longer: FMULS R2, R3, R4
This performs R3 * R4 and stores into R2. A register-register float multiply like this takes 3 cycles on the Cortex-M0 FPU. This is 1 cycle longer than using an immediate because the second operand has to be read from a register instead of being available as an immediate constant.
3. Memory Operand Multiply
Accessing operands in memory instead of registers adds more cycles. For example: VLDR S1, [R5] ; Load float value from memory into S1 FMULS S0, S1, S2 ; S0 = S1 * S2
The load from memory requires 3 cycles, and the multiply takes the standard 3 cycles, so this sequence takes 6 cycles total. In many cases the processor will pipeline the load and multiply so they occur in parallel, idealized as: Cycle 1: Initiate memory load Cycle 2: Load completes, initiate multiply Cycle 3: Multiply completes
So with pipelining the memory operand multiply can take as few as 3 cycles, but there is more overhead than using pure register operands.
4. Multiplies Within Loops
When float multiplies are inside of a loop performing an operation on multiple data points, optimizations like loop unrolling and pipeline scheduling can help improve performance. For example, consider this C code: for (i = 0; i < 100; i++) { y[i] = a * x[i]; }
This performs 100 multiply-accumulates. The loop overhead would normally add extra cycles per iteration. However, the compiler can be smart and unroll the loop so more independent multiplies can be lined up back-to-back. Modern compilers can also pipeline and interleave the loads, multiplies, stores to hide memory latency. So while each individual multiply may still take 3-6 cycles depending on operands, the overall throughput across loop iterations can be significantly improved.
5. Double-Precision Multiplies
The Cortex-M0 FPU does not natively support double-precision (64-bit) floats. Doing double-precision multiplies in software through emulation routines would take many more cycles, likely 10-100x slower than single-precision depending on the implementation.
Optimizing Floating Point Code
Here are some tips for optimizing floating point code to reduce multiply cycles on the Cortex-M0 FPU:
- Use register operands instead of memory operands whenever possible
- Try to plan instruction order to pipeline loads/stores with multiplies
- Unroll tight loops performing many multiplies
- Use compilers that schedule instructions to avoid pipeline stalls
- Minimize reading and writing to main memory with caches if available
- Consider using fixed-point math instead of float when precision requirements are lower
Cycle Count Examples
To illustrate the range of cycles for floating point multiplies on the Cortex-M0, here are some examples with instruction-level simulation estimates: Instruction Cycles ————————————————— FMULS R0, R1, #3.5 2 cycles ; immediate multiply FMULS R2, R3, R4 3 cycles ; register multiply VLDR S1, [R5] 3 cycles ; load memory operand FMULS S0, S1, S2 3 cycles ; memory operand multiply Total: 6 cycles FMULS S4, S5, S6 3 cycles ; no data dependencies FMULS S0, S1, S2 2 cycles ; pipelined back-to-back Total: 5 cycles ; Unrolled loop with pipelining LOOP: VLDR S1, [R1], #4 3 cycles ; load X FMULS S2, S1, S3 3 cycles ; multiply X * A VLDR S4, [R1], #4 1 cycle ; load next X FMULS S5, S4, S3 2 cycles ; next multiply STR S2, [R4], #4 1 cycle ; store Y STR S5, [R4], #4 3 cycles ; store next Y Loop overhead: 3 cycles Total for 2 iterates: 19 cycles
So in summary, for a single multiply the cycle count ranges from 2 cycles for an immediate operand up to 6 cycles for a memory operand scenario. With loop unrolling and pipelining, the average number of cycles per multiply can be 3-4 cycles. Overall we see that on the Cortex-M0 FPU, floating point multiplies take anywhere from 2-34 cycles depending on the context.
Conclusion
The ARM Cortex-M0 is capable of executing single-precision floating point multiplies in as few as 2 cycles, or up to 34 cycles or more depending on the operand types, pipelining, memory architecture, and compiler optimizations. Tightly optimized floating point code with ample caching and register usage can achieve very good performance with multiply throughput in the 2-4 cycle range. But performance degrades significantly if many operands have to be loaded from main memory. Low-power microcontrollers like Cortex-M0 benefit greatly from having hardware floating point support rather than relying on software emulation. By understanding the instruction set and cycle timings, developers can write efficient floating point code tailored for embedded applications on Cortex-M0 and similar microcontrollers.