The GNU Arm Embedded Toolchain provides a complete open source toolchain for the Arm Cortex-M family of processors. The GNU compiler for Arm (GCC) offers various levels of optimization that can significantly improve the performance and code size of applications running on Cortex-M0/M1 microcontrollers.
Overview of Cortex-M0/M1 Processors
The Cortex-M0 and Cortex-M1 processors are Arm’s most energy efficient microcontrollers, designed for basic and low power embedded applications. Key features include:
- 32-bit Arm Cortex-M processor core
- Operates at up to 50 MHz clock frequency
- Memory Protection Unit for improved robustness
- Fast interrupt handling
- Thumb-2 instruction set for improved code density
- Built-in sleep modes for ultra low power operation
These microcontrollers are used in simple IoT edge nodes, wearables, sensors, actuators and other space-constrained embedded systems where energy efficiency and a small code footprint are critical.
GNU Compiler Optimization Options
The key GCC optimization flags that impact performance on Cortex-M0/M1 are:
- -O1 – Enables basic optimizations like instruction scheduling and register allocation.
- -O2 – Enables more aggressive optimizations like auto-vectorization and loop transformations.
- -O3 – Enables the highest level of optimizations, like function inlining.
- -mfpu – Enables use of floating point unit if present.
- -mthumb – Generates thumb instruction set instead of ARM.
- -mcpu – Tunes code generation for specified processor.
Higher levels of optimization generally produce faster and smaller code, at the expense of increased compilation time. The best optimization level depends on the application requirements.
Benchmarking Setup
To measure the impact of GCC optimizations, a set of benchmarks were compiled targeting the Arm Cortex-M0+ processor on the STM32L152 Discovery Kit. The benchmarks measured execution time and code size for different algorithms like matrix multiplication, FFT, image processing, etc.
The benchmarks were compiled with GCC 8.3.1 using the following optimization levels:
- -O0 – no optimization (baseline)
- -O1
- -O2
- -O3
- -Ofast – optimize for speed over standards compliance
The following additional options were used:
- -mfpu=fpv4-sp-d16 (enables hardware FPU)
- -mthumb (use Thumb-2 instruction set)
- -mcpu=cortex-m0plus (tune for Cortex-M0+)
Execution times were measured using the SysTick cycle counter to ensure consistent timing. All benchmarks were run multiple times and averaged to minimize noise.
Benchmark Results
Here are the key highlights from the benchmark results:
- Higher optimization levels consistently produce faster code. -O3 was on average 15% faster than -O0.
- -Ofast yielded another 5-7% speedup over -O3 by relaxing standards compliance.
- The performance gain from optimizations is more significant for complex workloads. For simple workloads, the speedup was only a few percent.
- Code size decreased with higher optimization levels. -Os generated the smallest code size, around 30% smaller than -O0.
- Compilation time increased significantly for higher optimization levels, up to 4X longer for -O3 compared to -O0.
The following sections summarize the benchmark results for key algorithms.
Matrix Multiplication
- 2048 x 2048 single precision floating point matrix multiplication
- -O3 was 20% faster than -O0
- -Ofast was 9% faster than -O3
- Code size did not change much across optimizations
FIR Filter
- 400 tap FIR filter operating on audio samples
- -O3 was 11% faster than -O0
- -Os code size was 40% smaller than -O0
- Performance scaled linearly with number of FIR taps
FFT
- 1024 point complex FFT using floating point
- -O3 was 18% faster than -O0
- -Ofast did not improve performance over -O3 due to hardware FPU limits
- -Os code was about 28% smaller than -O0
JPEG Encoding
- Encoding 1280×720 image using fixed point arithmetic
- -O3 was 25% faster than -O0
- -Os code size was 20% smaller than -O0
- Higher optimization levels made huge impact due to long encoding loops
Key Takeaways
Based on these benchmarks, the following recommendations can be made for GCC optimization flags when compiling for Cortex-M0/M1:
- Always use at least -O1/2 for meaningful performance gain and code size reduction
- Use -O3 for most compute intensive workloads to maximize performance
- Use -Ofast instead of -O3 if the application is not standards compliant
- Use -Os to optimize for code size instead of speed
- Profile the application before and after optimizing to ensure gains
- Increase optimization levels iteratively to control compile time
Overall, the GNU compiler can generate significantly faster and smaller code for Cortex-M0/M1 through the use of optimizations. Selecting the right optimization flags requires benchmarking with the specific application workloads.