Efficient Code Generation with GNU-ARM for Cortex-M0/M1

The Cortex-M0 and Cortex-M1 are two of ARM’s most widely used cores for microcontroller applications. With their low power consumption, small silicon footprint, and optimized Thumb-2 instruction set, they are ideal for cost-sensitive and power-constrained embedded systems. However, writing efficient code for these cores requires an understanding of their architecture and the toolchain used to build software for them. This article provides guidance on leveraging the GNU ARM Embedded Toolchain, specifically the GNU compiler (gcc) and assembler (gas), to generate optimized code for Cortex-M0/M1.

Contents

Code Size Optimization
Compiler Optimizations

Assembly Coding Techniques
Linker Optimization
Startup Code Optimization

Function Inlining
Loop Optimizations
Memory Optimizations

Debugging Optimized Code
Conclusion

Code Size Optimization

One of the primary goals of optimization for microcontroller applications is reducing code size. The Cortex-M0/M1 have only 32KB to 256KB of flash memory, so it’s important to minimize the size of the compiled program. Here are some techniques to reduce code size with gcc/gas:

Use the -Os compiler flag to optimize for size rather than speed. This enables optimizations like function inlining focused on minimizing overall code size.
Build in Thumb-2 mode using the -mthumb option. Thumb has higher code density than ARM instruction set.
Enable link-time optimization (LTO) with -flto to allow optimizations across translation units.

Use shorter data types like int vs long where possible.
Declare functions and variables static where possible to limit scope.
Build with -ffunction-sections and -fdata-sections to allow unused code removal at link time.

Compiler Optimizations

The gcc compiler offers many flags to enable different optimizations that can improve performance and reduce size. Some key options to consider for Cortex-M0/M1 include:

-fno-exceptions – Disables exception handling to reduce code size.
-fno-rtti – Disables runtime type information.

-ffreestanding – Assumes no runtime environment/library.
-fno-toplevel-reorder – Prevents reordering of top-level functions, improving code locality.
-fno-strict-aliasing – Improves optimizations involving pointers.

-fno-common – Places uninitialized globals in .bss rather than .common section.

Profile guided optimization (PGO) can also significantly improve performance by optimizing hot code paths. This involves compiling with -fprofile-generate, running representative workloads to generate profiling data, and recompiling with -fprofile-use to apply optimizations.

Assembly Coding Techniques

While C/C++ are the predominant languages used, hand-coding certain functions in assembly can provide size and speed benefits. Some techniques to write optimal assembly code for Cortex-M0/M1 include:

Manually assign registers and minimize spills to memory.
Optimize branch instructions using conditional execution.
Flatten and simplify nested loops.

Unroll small loops to reduce overhead.
Use single-cycle instructions like MOVS rather than MUL/DIV.
Assign variables to registers instead of memory where possible.

Use narrow Thumb instructions like MOV/ADD rather than ARM equivalents.

well-coded assembly functions can see 2-3x improvement in speed and 50% reduction in size compared to compiled C code. But it requires significant experience to out-perform the compiler, so only apply to the most performance-critical code sections.

Linker Optimization

The linker plays a critical role in optimizing final executable size and performance. Key strategies include:

Place routines and data accessed by interrupts in separate sections using gcc attributes to allow better placement in memory.
Use --gc-sections to remove unused code sections.
Use --print-gc-sections to see a map of unused sections removed.

Place frequently executed code in flash rather than RAM using execute-in-place techniques.
Split noble constants like strings into separate sections from code using const attribute.
Use multi-section garbage collection (--gc-sections) to eliminate unused code.

Checking the map file after linking is important to verify unused code elimination and ideal placement of functions/data.

Startup Code Optimization

The startup code executed before main() provides the early system initialization like stacking configuration and data/bss initialization. Tightly optimizing this code can enhance performance during the early part of execution. Useful techniques include:

Initialize stack pointer to highest available memory address to maximize stack.

Copy initialized variables into RAM using optimal memcpy assembly.
Set uninitialized data section (.bss) to zero using a block clear loop.
Enable early CPU caches and prefetch units if available.

Optimize interrupt and exception vector tables for best performance.

Optimized startup code helps reduce latency to get into main application code.

Function Inlining

Inlining small functions by inserting the function body at the call site eliminates call/return overhead. This reduces execution time but increases code size. GCC’s inliner heuristics are constrained for small embedded devices. Useful inlining techniques include:

Manually inline using inline keyword for critical small routines.
Use -finline-limit to configure max size of functions inlined.
-fno-inline-functions-called-once to inline even if called once.

Selectively inline using __attribute__((always_inline)).
Rewrite as macro if function body is simple expression.

Inlining judiciously balances the tradeoff between performance and code size.

Loop Optimizations

Loops are common causes of poor performance in embedded programs. Optimizing loop performance on Cortex-M0/M1 can provide significant speedups. Useful loop optimizations include:

Unroll small constant loops using -funroll-loops.
Flatten nested loops into single loops.

Optimize loop control using restrictions to enable better pipelining.
Use -fno-unroll-loops to prevent unrolling of small loops.
Cache key indexes, pointers, and bounds checks outside loop body.

Align loop bodies to Thumb 2-byte boundaries.

With loops often dominating execution time, optimized looping constructs can result in major performance gains for many embedded applications.

Memory Optimizations

Optimizing usage of limited Cortex-M0/M1 memory resources can enable larger and better-performing programs:

Allocate frequently accessed variables to tightly packed registers using register keyword.
Declare large array variables static const to place in flash rather than RAM.
Use -mno-unaligned-access to force aligned memory access.

Split code and data into separate memory regions.
Allocate buffers and queues statically where possible to reduce heap fragmentation.

Careful use of limited RAM and flash memory through selective allocation into regions helps avoid out-of-memory errors and virtual memory thrashing.

Debugging Optimized Code

Heavy compiler optimizations can complicate debugging by reordering code segments. Useful techniques to debug optimized code include:

Build debug versions with -Og to reduce optimizations.
Step through code at assembly level using debugger views.

Enable debug symbols and disable optimizations only in modules being debugged.
Use debugger data breakpoints to monitor variable values.
Enable assembly source interleaving to view original C source.

With careful use of debug project configurations and assembly-level single-stepping, even highly optimized code can be debugged effectively.

Conclusion

The Cortex-M0 and M1 offer compelling performance and efficiency for microcontroller applications. However, developers must leverage the full capabilities of the GNU ARM toolchain to generate truly optimized code for these devices. Following the techniques outlined in this article, significant improvements in code size, speed, memory usage, and power efficiency can be achieved. Striking the right balance between optimizations and debuggability relies on thorough understanding of the hardware architecture as well as creative use of the compiler, assembler, and linker. Mastering these GNU ARM tools helps developers craft elegant embedded software solutions that maximize the potential of the Cortex-M0/M1 cores.

Efficient Code Generation with GNU-ARM for Cortex-M0/M1

Code Size Optimization

Compiler Optimizations

Assembly Coding Techniques

Linker Optimization

Startup Code Optimization

Function Inlining

Loop Optimizations

Memory Optimizations

Debugging Optimized Code

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

Running debugger on Cortex M0 on FPGA

How fast is the Arm Cortex-M4?

Sleep-on-Exit for Automated Low Power in Cortex-M3 (Explained)

How to Program ARM Cortex M4