The GNU ARM compiler (arm-none-eabi-gcc) is a widely used toolchain for compiling code for ARM Cortex-M0 and Cortex-M1 microcontrollers. However, the compiler is not always able to generate optimal code for these resource-constrained chips. This can lead to inefficient code that takes up more flash memory, runs slower, and consumes extra power. Fortunately, with some workarounds during coding and compilation, many of these inefficiencies can be avoided.
Loop Unrolling
One common issue is that the compiler does not unroll small loops. Unrolling a loop reduces the number of branch instructions needed, avoiding pipeline stalls. This can significantly speed up execution. But the compiler often fails to apply this optimization automatically. To force loop unrolling, use pragmas:
#pragma unroll_loop_start
for(i = 0; i < 16; i++) {
...
}
#pragma unroll_loop_end
Be careful not to unroll loops too much, as it can bloat code size. Test different unroll amounts to find the best performance vs. size trade-off.
Function Inlining
Inlining small functions removes function call overhead. But the compiler heuristics are sometimes too conservative, missing inlining opportunities. For important functions, use attributes to force inlining:
inline __attribute__((always_inline)) void myFunction() {
...
}
This inlines the function even when the compiler deems it too large. Again, balance the benefits against potential code size increase.
Optimized Library Functions
The C standard libraries contain unoptimized generic implementations of functions like memcpy. These under-utilize Cortex-M0/M1 features. Instead, use optimized implementations designed specifically for Cortex-M. For example:
#include "cmsis_gcc.h"
...
__gnu_ thumb1_memcpy(dst, src, n);
The __gnu_thumb1 versions use hardware features like load/store multiple instructions. This accelerates small copies significantly.
Efficient Data Access
Cortex-M0/M1 have only half-word or byte data access without instruction bus accesses. So access data using the smallest type possible:
uint8_t array[128];
...
uint8_t b = array[i]; // Not uint32_t
This avoids having to do read-modify-write for parts of larger variables. Accessing single bytes or halfwords is efficient.
Memory Regions for Switch Statements
Large switch statements can compile to inefficient jump tables. Putting the jump table in a special memory region allows optimized access:
jumptable_section_start
jumptable_section_entry(values)
jumptable_section_end
__attribute__((section(".jumptable_section")))
const uint32_t values[] = { ... };
switch(x) {
jumptable_section_entry(values)
...
}
This creates a compact jump table using 16-bit offsets. Otherwise, 32-bit pointers may be needed, increasing size.
Stack Usage
Cortex-M0/M1 have very small stack sizes, often just a few KB. Avoid recursive functions and large stack variables. Where possible, allocate larger data statically or dynamically on the heap instead. Stack overflows are difficult to debug on Cortex-M.
// Avoid
uint8_t stackArray[1024];
// Prefer
static uint8_t staticArray[1024];
Use stack checking and watermark features offered by various RTOS and debug tools. This helps catch issues early.
Initialization of Large Static Arrays
Large static arrays should be initialized like:
static uint32_t values[] = {0x13, 0x37, ... };
The compiler stores the initial values compactly and copies them to RAM on startup. Avoid this:
static uint32_t values[1024];
void init() {
for(int i = 0; i < 1024; i++) {
values[i] = i;
}
}
The code version executes at runtime, is slower, and consumes energy.
Alignment of Buffers
Use word aligned buffers where possible:
uint32_t buffer[128] __attribute__((aligned(4)));
This enables word and halfword access without bus errors. For DMA, alignment to cache lines may provide a big speedup.
Use of Cortex-M0+/M1 DSP Instructions
The DSP extension provides useful instructions for signal processing like bit rearrangement. Use intrinsics to leverage these:
uint32_t reversed = __REV(value); // Reverse bits
This uses a single instruction instead of multiple shifts/rotates.
Tight Assembly Loops
For critical loops, drop to assembly to optimize further than the compiler:
asm volatile(
".syntax unified \n"
"movs r2, #16 \n"
"loop: \n"
" SUBS R2, R2, #1 \n"
" mul r3, r0, r1 \n"
" adds r0, r3 \n"
" bne loop \n"
: "+r" (r0)
: "r" (r1)
: "r2", "r3"
);
This achieves higher performance but is not portable. Use only when needed.
Reduce Code Size with Thumb-1
Thumb-1 instruction set has higher code density than Thumb-2. Functions rarely needing 32-bit instructions can benefit:
__attribute__((target("thumb-1")))
void func() {
...
}
This compiles the function to Thumb-1. But beware, Thumb-1 has limitations like smaller branch ranges.
Compiler Optimizations
Aggressive optimizations can sometimes improve performance but increase size. Benchmark with options like:
-O3
-funroll-loops
-finline-functions
-fsingle-precision-constant
-fno-math-errno
Be careful not to push the compiler too hard, as it may generate illegal instructions in some cases.
Summary
The Cortex-M0 and Cortex-M1 are optimized for low-power IoT endpoints. But getting the most from them requires working around compiler inefficiencies during coding and compilation. Techniques like manual loop unrolling, memory layout optimizations, and using tight assembly can help boost performance, reduce code size, and lower power consumption. With attention to details like this, developers can squeeze every last drop of efficiency from these microcontroller workhorses.