Workarounds for GNU-ARM Compiler Inefficiencies on Cortex-M0/M1

The GNU ARM compiler (arm-none-eabi-gcc) is a widely used toolchain for compiling code for ARM Cortex-M0 and Cortex-M1 microcontrollers. However, the compiler is not always able to generate optimal code for these resource-constrained chips. This can lead to inefficient code that takes up more flash memory, runs slower, and consumes extra power. Fortunately, with some workarounds during coding and compilation, many of these inefficiencies can be avoided.

Contents

Loop Unrolling
Function Inlining

Optimized Library Functions
Efficient Data Access
Memory Regions for Switch Statements

Stack Usage
Initialization of Large Static Arrays
Alignment of Buffers

Use of Cortex-M0+/M1 DSP Instructions
Tight Assembly Loops
Reduce Code Size with Thumb-1

Compiler Optimizations
Summary

Loop Unrolling

One common issue is that the compiler does not unroll small loops. Unrolling a loop reduces the number of branch instructions needed, avoiding pipeline stalls. This can significantly speed up execution. But the compiler often fails to apply this optimization automatically. To force loop unrolling, use pragmas:

#pragma unroll_loop_start 
for(i = 0; i < 16; i++) {
   ...
}
#pragma unroll_loop_end

Be careful not to unroll loops too much, as it can bloat code size. Test different unroll amounts to find the best performance vs. size trade-off.

Function Inlining

Inlining small functions removes function call overhead. But the compiler heuristics are sometimes too conservative, missing inlining opportunities. For important functions, use attributes to force inlining:

inline __attribute__((always_inline)) void myFunction() {
   ...
}

This inlines the function even when the compiler deems it too large. Again, balance the benefits against potential code size increase.

Optimized Library Functions

The C standard libraries contain unoptimized generic implementations of functions like memcpy. These under-utilize Cortex-M0/M1 features. Instead, use optimized implementations designed specifically for Cortex-M. For example:

#include "cmsis_gcc.h"
...
__gnu_ thumb1_memcpy(dst, src, n);

The __gnu_thumb1 versions use hardware features like load/store multiple instructions. This accelerates small copies significantly.

Efficient Data Access

Cortex-M0/M1 have only half-word or byte data access without instruction bus accesses. So access data using the smallest type possible:

uint8_t array[128];
...
uint8_t b = array[i]; // Not uint32_t

This avoids having to do read-modify-write for parts of larger variables. Accessing single bytes or halfwords is efficient.

Memory Regions for Switch Statements

Large switch statements can compile to inefficient jump tables. Putting the jump table in a special memory region allows optimized access:

jumptable_section_start
jumptable_section_entry(values)
jumptable_section_end

__attribute__((section(".jumptable_section")))
const uint32_t values[] = { ... }; 

switch(x) {
jumptable_section_entry(values)
   ...
}

This creates a compact jump table using 16-bit offsets. Otherwise, 32-bit pointers may be needed, increasing size.

Stack Usage

Cortex-M0/M1 have very small stack sizes, often just a few KB. Avoid recursive functions and large stack variables. Where possible, allocate larger data statically or dynamically on the heap instead. Stack overflows are difficult to debug on Cortex-M.

// Avoid
uint8_t stackArray[1024]; 

// Prefer
static uint8_t staticArray[1024];

Use stack checking and watermark features offered by various RTOS and debug tools. This helps catch issues early.

Initialization of Large Static Arrays

Large static arrays should be initialized like:

static uint32_t values[] = {0x13, 0x37, ... };

The compiler stores the initial values compactly and copies them to RAM on startup. Avoid this:

static uint32_t values[1024];

void init() {
  for(int i = 0; i < 1024; i++) {
     values[i] = i; 
  }
}

The code version executes at runtime, is slower, and consumes energy.

Alignment of Buffers

Use word aligned buffers where possible:

uint32_t buffer[128] __attribute__((aligned(4)));

This enables word and halfword access without bus errors. For DMA, alignment to cache lines may provide a big speedup.

Use of Cortex-M0+/M1 DSP Instructions

The DSP extension provides useful instructions for signal processing like bit rearrangement. Use intrinsics to leverage these:

uint32_t reversed = __REV(value); // Reverse bits

This uses a single instruction instead of multiple shifts/rotates.

Tight Assembly Loops

For critical loops, drop to assembly to optimize further than the compiler:

asm volatile(
   ".syntax unified           \n"
   "movs r2, #16               \n" 
   "loop:                     \n"
   " SUBS R2, R2, #1          \n" 
   " mul r3, r0, r1           \n"
   " adds r0, r3              \n"
   " bne loop                 \n"
   : "+r" (r0)
   : "r" (r1)
   : "r2", "r3"
);

This achieves higher performance but is not portable. Use only when needed.

Reduce Code Size with Thumb-1

Thumb-1 instruction set has higher code density than Thumb-2. Functions rarely needing 32-bit instructions can benefit:

__attribute__((target("thumb-1")))
void func() {
   ...
}

This compiles the function to Thumb-1. But beware, Thumb-1 has limitations like smaller branch ranges.

Compiler Optimizations

Aggressive optimizations can sometimes improve performance but increase size. Benchmark with options like:

-O3 
-funroll-loops
-finline-functions 
-fsingle-precision-constant
-fno-math-errno

Be careful not to push the compiler too hard, as it may generate illegal instructions in some cases.

Summary

The Cortex-M0 and Cortex-M1 are optimized for low-power IoT endpoints. But getting the most from them requires working around compiler inefficiencies during coding and compilation. Techniques like manual loop unrolling, memory layout optimizations, and using tight assembly can help boost performance, reduce code size, and lower power consumption. With attention to details like this, developers can squeeze every last drop of efficiency from these microcontroller workhorses.

Workarounds for GNU-ARM Compiler Inefficiencies on Cortex-M0/M1

Loop Unrolling

Function Inlining

Optimized Library Functions

Efficient Data Access

Memory Regions for Switch Statements

Stack Usage

Initialization of Large Static Arrays

Alignment of Buffers

Use of Cortex-M0+/M1 DSP Instructions

Tight Assembly Loops

Reduce Code Size with Thumb-1

Compiler Optimizations

Summary

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

Disabling All Interrupts on ARM Cortex-M0

Is the arm processor the same as the ESP32?

When was ARM microcontroller invented?

Off-chip Memory integration with Cortex-M0