The ARM Cortex-M0 is an ultra low power 32-bit RISC processor core licensed by ARM Holdings. It is aimed at microcontroller applications that require a low gate count, small silicon footprint, and low power consumption. One of the key metrics to evaluate the performance of a RISC processor like Cortex-M0 is its cycles per instruction (CPI) or clocks per instruction (CPI). This determines how many clock cycles are required on average to execute an instruction on the processor.
What is Cycles Per Instruction?
Cycles per instruction (CPI) refers to the average number of clock cycles required to execute an instruction on a processor. It is calculated by dividing the total number of clock cycles needed to complete a program by the total number of instructions in the program.
CPI = Total Clock Cycles / Total Instructions
A lower CPI indicates higher performance, as fewer cycles are needed per instruction. Typical CPI values range from 0.5 to 2 for RISC processors like ARM Cortex-M0. A CPI of 1 means each instruction takes one clock cycle to execute on average.
ARM Cortex-M0 CPI
The ARM Cortex-M0 processor has a CPI of 1, meaning each instruction takes a single cycle to execute on average. This gives it a very high instruction throughput for a microcontroller. The simple, non-pipelined 3-stage execution pipeline of Cortex-M0 achieves this CPI of 1.
The 3 execution stages are:
- Fetch – Instruction is fetched from memory
- Decode – Instruction is decoded into control signals
- Execute – Instruction is executed
As soon as an instruction passes through the 3 stages, the next instruction can begin execution. So in an ideal scenario, the Cortex-M0 can complete executing 1 instruction per cycle. Even complex instructions with multiple operations take just 1 cycle.
How Cortex-M0 Achieves a CPI of 1
Here are some key architectural features that enable Cortex-M0 to achieve a CPI of 1:
- 3 stage pipeline – The short pipeline length means instructions can quicky move through the stages without stalls.
- Single cycle execution – All instructions in Cortex-M0 ISA execute in 1 cycle regardless of complexity.
- No pipeline interlocks – Simpler design does not require hazard detection and interlocking logic.
- Low latency memory access – Load/store instructions in single cycle via tightly coupled memory.
- No branch prediction – All branches are delayed branches so no pipeline bubbles.
- Long branch offsets – Branches can target +-256KB for flexibility.
- Hardwired control logic – Decoding is fast and simple with minimal microcode.
By leveraging these microarchitectural features, the Cortex-M0 core avoids common pitfalls like pipeline stalls, bubbles, and flushes that degrade CPI in more complex processors. The streamlined 3-stage pipeline achieves high utilization and throughput.
Effects of Cortex-M0 Design on CPI
The Cortex-M0 architecture is highly optimized to achieve a CPI of 1. This comes at the cost of some design tradeoffs:
- No cache – Caches can cause variable latency and pipeline stalls.
- No write buffer – Write buffers can create data hazards.
- No branch prediction – Branch prediction uses speculative execution which increases power.
- No instruction reordering – Reordering causes variable execution times.
- No interrupts during execution – Interrupts stall pipelines.
By avoiding performance enhancing techniques like the above, Cortex-M0 simplifies the logic to focus solely on achieving single cycle instruction execution for low power operation.
Benchmarking CPI on Cortex-M0
While the ideal CPI is 1 for Cortex-M0, real-world CPI can vary slightly from this figure based on code execution patterns. Some techniques to measure actual CPI include:
- Running representative code benchmarks and measuring clock cycles taken.
- Using simulation models like ARM Cycle Models to analyze performance in detail.
- Instrumenting assembly code and logging time taken for execution segments.
- Using debug timers and hardware counters to profile sections of code.
- Calculating overall CPI based on total cycles and instruction counts.
The Cortex-M0 data sheet specifies some of these measured CPI values across common benchmarks like CoreMark, DMIPS, Linpack, Livermore Loops etc. The average comes to around 1.03 CPI, very close to the ideal of 1 CPI.
Optimizing Code for Cortex-M0’s 1 CPI
To fully leverage the 1 CPI capability in Cortex-M0, code can be optimized in certain ways:
- Use simple straight line code instead of branches when possible.
- Allocate variables to registers instead of memory.
- Minimize pipeline stalls by avoiding complex flag dependencies.
- Optimize hot loops for execution speed.
- Place critical code and data in tightly coupled memory.
- Reduce unnecessary memory accesses.
- Inline small critical functions.
- Optimize code to fit in instruction cache lines.
By reducing pipeline bubbles and stalls through code optimization techniques, the 1 CPI advantage of Cortex-M0 can be leveraged to maximize performance.
Comparisons to Other ARM Cores
Compared to other ARM processor cores, the Cortex-M0 achieves a very low CPI due to its highly optimized in-order pipeline:
- Cortex-M3/M4 – Up to 1.25 CPI with longer 5-stage pipeline.
- Cortex-A5 – 1.57 CPI with complex out-of-order pipeline.
- Cortex-A9 – 1.38 CPI for dual-issue pipeline.
- Cortex-A15 – Around 1.5-2 CPI despite 20+ stage pipeline.
Thus, the Cortex-M0 hits a sweet spot of simplicity and efficiency with its CPI of 1. This makes it well-suited for ultra low power embedded applications.
Conclusion
In summary, the ARM Cortex-M0 processor is designed to achieve a CPI of 1 to enable highly efficient code execution. Its short 3 stage pipeline, single cycle instruction execution, lack of microarchitectural complexity, and other design choices enable every instruction to execute in just 1 clock cycle on average. Real-world CPI can approach 1.1 to 1.2, but still reflect the efficiency gains. Coding techniques can optimize to fully leverage the 1 CPI advantage. Overall, the Cortex-M0’s CPI of 1 is a key strength that delivers excellent performance per clock for low power embedded systems.