The ARM Cortex M0+ is one of the simplest and most basic ARM processor cores, aimed at low-cost and low-power embedded applications. It does not contain advanced features like a memory protection unit, floating point unit, or cache. However, it can still be useful for basic tasks in an embedded system. One common need is to create short delays or wait states in the code, for example when waiting for an external peripheral or sensor. The Cortex M0+ does not contain any built-in timer peripherals, so we need to find another way to create delays.
Using NOPs for Short Delays
One simple way to create a short delay is by inserting NOP (no operation) instructions. The processor will waste cycles executing these NOPs, creating the desired delay. For example:
// Delay approximately 10 cycles
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
This is very straightforward, but has limited utility since the maximum delay is limited by the size of the code memory. Also, the exact delay depends on the CPU frequency and will not be precise. But for very short delays of a few instructions, NOPs can be useful.
Busy Loop Delays
A more flexible approach is to execute a busy loop for the desired number of cycles. This allows creating longer delays, up to billions of cycles if needed. Here is an example busy loop using a volatile counter variable:
// Delay for approx n cycles
void delay(int n) {
volatile int i;
for(i = 0; i < n; i++);
}
This simple loop iterates n times, wasting cycles, before continuing program execution. The volatile keyword tells the compiler not to optimize away the loop counter. The delay depends on the CPU frequency – at 1 MHz, a 1000 cycle delay would be approximately 1 millisecond.
Optimizing Busy Loop Delays
The busy loop method can be improved in several ways:
- Use an unsigned integer for the counter – this increases the maximum delay
- Initialize counter variable outside loop – avoids overhead each iteration
- Use nested loops to increase maximum delay
- Unroll inner loop iterations for better performance
Here is an optimized busy loop approach:
// Higher max delay, reduced overhead
void delay(uint32_t n) {
uint32_t i;
// Init counter
i = n;
// Outer loop
while(i > 0) {
// Unrolled inner loop
asm volatile(
"nop\n\t"
"nop\n\t"
"nop\n\t"
"nop\n\t"
"sub %0, #1\n\t"
: "+r" (i)
);
}
}
This delays approximately n CPU cycles. With the 32-bit counter, the maximum single delay is about 4.29 billion cycles, or 71 minutes at 1 MHz. The unrolled inner loop reduces loop overhead.
Cycle-Accurate Delays
The previous busy loop methods provide delays in terms of CPU clock cycles. However, they do not account for the actual cycles used by each loop iteration. For example, the inner loop may take 5 cycles instead of the expected 1 cycle per loop. This means the delay will be 5x shorter than expected.
To create truly cycle-accurate delays, we need to measure and compensate for the overhead cycles used by the loop logic itself. This can be done by calibrating the delay loop on the target system:
- Initialize loop counter and start timer
- Execute busy loop for n iterations
- Stop timer and check elapsed cycles
- Calculate overhead cycles per loop iteration
- Compensate delays using calculated overhead
Here is example code to perform this calibration process:
// Calibrate delay loop
void calibrate_delay() {
// Known n loop iterations
int n = 1000;
// Start cycle counter
start_cycle_count();
// Execute test loop
for(int i = 0; i < n; i++) {
asm volatile(
"nop\n\t"
);
}
// Stop cycle counter
uint32_t elapsed = stop_cycle_count();
// Overhead per loop
uint32_t overhead = elapsed / n;
// Compensate delays using overhead
delay_cycles = overhead * delay_iterations;
}
By measuring the actual elapsed cycles for a fixed number of loop iterations, we can calculate the per-loop overhead cycles. This overhead is then used to compensate when creating delays, providing higher accuracy.
Delay Routine in Assembly
For ultimate performance and flexibility, the delay loop can be hand-coded in assembly language. This allows full control over the loop behavior and overhead.
Here is an example delay routine in ARM Thumb assembly:
.global delay
.thumb_func
delay:
// Counter in r0
movs r1, #0
loop:
nop
sub r0, #1
cmp r0, r1
bne loop
bx lr
The delay parameter is passed in register r0. The loop is very tight, with only 1 NOP instruction inside the loop. The overhead is only a few cycles, enabling very precise delays. Maximum delay depends on counter size – 32-bit allows over 4 billion cycles.
In summary, several techniques exist for creating delays on the Cortex M0+:
- NOP instructions for very short delays
- Busy loop written in C, adjustable duration
- Busy loop in assembly for high performance
- Calibrating loops for maximum accuracy
Delays allow simple integration of wait states into code flow for peripherals, sensors, or other external events. With care, delays of nanosecond precision are possible.