How to delay an ARM Cortex M0+ for n cycles, without a timer?

The ARM Cortex M0+ is one of the simplest and most basic ARM processor cores, aimed at low-cost and low-power embedded applications. It does not contain advanced features like a memory protection unit, floating point unit, or cache. However, it can still be useful for basic tasks in an embedded system. One common need is to create short delays or wait states in the code, for example when waiting for an external peripheral or sensor. The Cortex M0+ does not contain any built-in timer peripherals, so we need to find another way to create delays.

Contents

Using NOPs for Short Delays Busy Loop Delays Optimizing Busy Loop Delays Cycle-Accurate Delays Delay Routine in Assembly

Using NOPs for Short Delays

One simple way to create a short delay is by inserting NOP (no operation) instructions. The processor will waste cycles executing these NOPs, creating the desired delay. For example:


// Delay approximately 10 cycles 
NOP
NOP
NOP
NOP  
NOP
NOP
NOP
NOP
NOP
NOP

This is very straightforward, but has limited utility since the maximum delay is limited by the size of the code memory. Also, the exact delay depends on the CPU frequency and will not be precise. But for very short delays of a few instructions, NOPs can be useful.

Busy Loop Delays

A more flexible approach is to execute a busy loop for the desired number of cycles. This allows creating longer delays, up to billions of cycles if needed. Here is an example busy loop using a volatile counter variable:


// Delay for approx n cycles
void delay(int n) {
  volatile int i;
  for(i = 0; i < n; i++); 
}

This simple loop iterates n times, wasting cycles, before continuing program execution. The volatile keyword tells the compiler not to optimize away the loop counter. The delay depends on the CPU frequency – at 1 MHz, a 1000 cycle delay would be approximately 1 millisecond.

Optimizing Busy Loop Delays

The busy loop method can be improved in several ways:

Use an unsigned integer for the counter – this increases the maximum delay
Initialize counter variable outside loop – avoids overhead each iteration
Use nested loops to increase maximum delay

Unroll inner loop iterations for better performance

Here is an optimized busy loop approach:


// Higher max delay, reduced overhead 
void delay(uint32_t n) {
  uint32_t i;

  // Init counter
  i = n; 

  // Outer loop
  while(i > 0) {

    // Unrolled inner loop
    asm volatile(
      "nop\n\t"
      "nop\n\t"
      "nop\n\t"
      "nop\n\t"
      "sub %0, #1\n\t"  
      : "+r" (i)
    );
  }
}

This delays approximately n CPU cycles. With the 32-bit counter, the maximum single delay is about 4.29 billion cycles, or 71 minutes at 1 MHz. The unrolled inner loop reduces loop overhead.

Cycle-Accurate Delays

The previous busy loop methods provide delays in terms of CPU clock cycles. However, they do not account for the actual cycles used by each loop iteration. For example, the inner loop may take 5 cycles instead of the expected 1 cycle per loop. This means the delay will be 5x shorter than expected.

To create truly cycle-accurate delays, we need to measure and compensate for the overhead cycles used by the loop logic itself. This can be done by calibrating the delay loop on the target system:

Initialize loop counter and start timer

Execute busy loop for n iterations
Stop timer and check elapsed cycles
Calculate overhead cycles per loop iteration

Compensate delays using calculated overhead

Here is example code to perform this calibration process:


// Calibrate delay loop
void calibrate_delay() {

  // Known n loop iterations
  int n = 1000;

  // Start cycle counter
  start_cycle_count(); 

  // Execute test loop
  for(int i = 0; i < n; i++) {
    asm volatile(
      "nop\n\t"
    );
  }

  // Stop cycle counter
  uint32_t elapsed = stop_cycle_count();

  // Overhead per loop
  uint32_t overhead = elapsed / n;

  // Compensate delays using overhead
  delay_cycles = overhead * delay_iterations;  
}

By measuring the actual elapsed cycles for a fixed number of loop iterations, we can calculate the per-loop overhead cycles. This overhead is then used to compensate when creating delays, providing higher accuracy.

Delay Routine in Assembly

For ultimate performance and flexibility, the delay loop can be hand-coded in assembly language. This allows full control over the loop behavior and overhead.

Here is an example delay routine in ARM Thumb assembly:


.global delay    

.thumb_func
delay:

  // Counter in r0
  movs r1, #0

loop:
  nop
  sub r0, #1
  cmp r0, r1
  bne loop

  bx lr

The delay parameter is passed in register r0. The loop is very tight, with only 1 NOP instruction inside the loop. The overhead is only a few cycles, enabling very precise delays. Maximum delay depends on counter size – 32-bit allows over 4 billion cycles.

In summary, several techniques exist for creating delays on the Cortex M0+:

NOP instructions for very short delays
Busy loop written in C, adjustable duration

Busy loop in assembly for high performance
Calibrating loops for maximum accuracy

Delays allow simple integration of wait states into code flow for peripherals, sensors, or other external events. With care, delays of nanosecond precision are possible.

How to delay an ARM Cortex M0+ for n cycles, without a timer?

Using NOPs for Short Delays

Busy Loop Delays

Optimizing Busy Loop Delays

Cycle-Accurate Delays

Delay Routine in Assembly

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

What are the disadvantages of ARM processors?

What is the size of the ARM Cortex-M3’s address bus?

Workarounds for Inefficient Code Generated by GNU-ARM for ARMv6-M CPUs

Modifying Stack Pointer (SP) and Program Counter (PC) in Cortex-M1