The Cortex-M0 and Cortex-M1 are two of ARM’s most widely used cores for microcontroller applications. With their low power consumption, small silicon footprint, and optimized Thumb-2 instruction set, they are ideal for cost-sensitive and power-constrained embedded systems. However, writing efficient code for these cores requires an understanding of their architecture and the toolchain used to build software for them. This article provides guidance on leveraging the GNU ARM Embedded Toolchain, specifically the GNU compiler (gcc) and assembler (gas), to generate optimized code for Cortex-M0/M1.
Code Size Optimization
One of the primary goals of optimization for microcontroller applications is reducing code size. The Cortex-M0/M1 have only 32KB to 256KB of flash memory, so it’s important to minimize the size of the compiled program. Here are some techniques to reduce code size with gcc/gas:
- Use the
-Os
compiler flag to optimize for size rather than speed. This enables optimizations like function inlining focused on minimizing overall code size. - Build in Thumb-2 mode using the
-mthumb
option. Thumb has higher code density than ARM instruction set. - Enable link-time optimization (LTO) with
-flto
to allow optimizations across translation units. - Use shorter data types like
int
vslong
where possible. - Declare functions and variables
static
where possible to limit scope. - Build with
-ffunction-sections
and-fdata-sections
to allow unused code removal at link time.
Compiler Optimizations
The gcc compiler offers many flags to enable different optimizations that can improve performance and reduce size. Some key options to consider for Cortex-M0/M1 include:
-fno-exceptions
– Disables exception handling to reduce code size.-fno-rtti
– Disables runtime type information.-ffreestanding
– Assumes no runtime environment/library.-fno-toplevel-reorder
– Prevents reordering of top-level functions, improving code locality.-fno-strict-aliasing
– Improves optimizations involving pointers.-fno-common
– Places uninitialized globals in .bss rather than .common section.
Profile guided optimization (PGO) can also significantly improve performance by optimizing hot code paths. This involves compiling with -fprofile-generate
, running representative workloads to generate profiling data, and recompiling with -fprofile-use
to apply optimizations.
Assembly Coding Techniques
While C/C++ are the predominant languages used, hand-coding certain functions in assembly can provide size and speed benefits. Some techniques to write optimal assembly code for Cortex-M0/M1 include:
- Manually assign registers and minimize spills to memory.
- Optimize branch instructions using conditional execution.
- Flatten and simplify nested loops.
- Unroll small loops to reduce overhead.
- Use single-cycle instructions like MOVS rather than MUL/DIV.
- Assign variables to registers instead of memory where possible.
- Use narrow Thumb instructions like MOV/ADD rather than ARM equivalents.
well-coded assembly functions can see 2-3x improvement in speed and 50% reduction in size compared to compiled C code. But it requires significant experience to out-perform the compiler, so only apply to the most performance-critical code sections.
Linker Optimization
The linker plays a critical role in optimizing final executable size and performance. Key strategies include:
- Place routines and data accessed by interrupts in separate sections using gcc attributes to allow better placement in memory.
- Use
--gc-sections
to remove unused code sections. - Use
--print-gc-sections
to see a map of unused sections removed. - Place frequently executed code in flash rather than RAM using
execute-in-place
techniques. - Split noble constants like strings into separate sections from code using
const
attribute. - Use multi-section garbage collection (
--gc-sections
) to eliminate unused code.
Checking the map file after linking is important to verify unused code elimination and ideal placement of functions/data.
Startup Code Optimization
The startup code executed before main() provides the early system initialization like stacking configuration and data/bss initialization. Tightly optimizing this code can enhance performance during the early part of execution. Useful techniques include:
- Initialize stack pointer to highest available memory address to maximize stack.
- Copy initialized variables into RAM using optimal memcpy assembly.
- Set uninitialized data section (.bss) to zero using a block clear loop.
- Enable early CPU caches and prefetch units if available.
- Optimize interrupt and exception vector tables for best performance.
Optimized startup code helps reduce latency to get into main application code.
Function Inlining
Inlining small functions by inserting the function body at the call site eliminates call/return overhead. This reduces execution time but increases code size. GCC’s inliner heuristics are constrained for small embedded devices. Useful inlining techniques include:
- Manually inline using
inline
keyword for critical small routines. - Use
-finline-limit
to configure max size of functions inlined. -fno-inline-functions-called-once
to inline even if called once.- Selectively inline using
__attribute__((always_inline))
. - Rewrite as macro if function body is simple expression.
Inlining judiciously balances the tradeoff between performance and code size.
Loop Optimizations
Loops are common causes of poor performance in embedded programs. Optimizing loop performance on Cortex-M0/M1 can provide significant speedups. Useful loop optimizations include:
- Unroll small constant loops using
-funroll-loops
. - Flatten nested loops into single loops.
- Optimize loop control using restrictions to enable better pipelining.
- Use
-fno-unroll-loops
to prevent unrolling of small loops. - Cache key indexes, pointers, and bounds checks outside loop body.
- Align loop bodies to Thumb 2-byte boundaries.
With loops often dominating execution time, optimized looping constructs can result in major performance gains for many embedded applications.
Memory Optimizations
Optimizing usage of limited Cortex-M0/M1 memory resources can enable larger and better-performing programs:
- Allocate frequently accessed variables to tightly packed registers using
register
keyword. - Declare large array variables
static const
to place in flash rather than RAM. - Use
-mno-unaligned-access
to force aligned memory access. - Split code and data into separate memory regions.
- Allocate buffers and queues statically where possible to reduce heap fragmentation.
Careful use of limited RAM and flash memory through selective allocation into regions helps avoid out-of-memory errors and virtual memory thrashing.
Debugging Optimized Code
Heavy compiler optimizations can complicate debugging by reordering code segments. Useful techniques to debug optimized code include:
- Build debug versions with
-Og
to reduce optimizations. - Step through code at assembly level using debugger views.
- Enable debug symbols and disable optimizations only in modules being debugged.
- Use debugger data breakpoints to monitor variable values.
- Enable assembly source interleaving to view original C source.
With careful use of debug project configurations and assembly-level single-stepping, even highly optimized code can be debugged effectively.
Conclusion
The Cortex-M0 and M1 offer compelling performance and efficiency for microcontroller applications. However, developers must leverage the full capabilities of the GNU ARM toolchain to generate truly optimized code for these devices. Following the techniques outlined in this article, significant improvements in code size, speed, memory usage, and power efficiency can be achieved. Striking the right balance between optimizations and debuggability relies on thorough understanding of the hardware architecture as well as creative use of the compiler, assembler, and linker. Mastering these GNU ARM tools helps developers craft elegant embedded software solutions that maximize the potential of the Cortex-M0/M1 cores.