GNU-ARM Compiler Performance for Cortex-M0/M1

The GNU Arm Embedded Toolchain provides a complete open source toolchain for the Arm Cortex-M family of processors. The GNU compiler for Arm (GCC) offers various levels of optimization that can significantly improve the performance and code size of applications running on Cortex-M0/M1 microcontrollers.

Contents

Overview of Cortex-M0/M1 Processors GNU Compiler Optimization Options Benchmarking Setup Benchmark Results Matrix Multiplication FIR Filter FFT JPEG Encoding Key Takeaways

Overview of Cortex-M0/M1 Processors

The Cortex-M0 and Cortex-M1 processors are Arm’s most energy efficient microcontrollers, designed for basic and low power embedded applications. Key features include:

32-bit Arm Cortex-M processor core

Operates at up to 50 MHz clock frequency
Memory Protection Unit for improved robustness
Fast interrupt handling

Thumb-2 instruction set for improved code density
Built-in sleep modes for ultra low power operation

These microcontrollers are used in simple IoT edge nodes, wearables, sensors, actuators and other space-constrained embedded systems where energy efficiency and a small code footprint are critical.

GNU Compiler Optimization Options

The key GCC optimization flags that impact performance on Cortex-M0/M1 are:

-O1 – Enables basic optimizations like instruction scheduling and register allocation.
-O2 – Enables more aggressive optimizations like auto-vectorization and loop transformations.

-O3 – Enables the highest level of optimizations, like function inlining.
-mfpu – Enables use of floating point unit if present.
-mthumb – Generates thumb instruction set instead of ARM.

-mcpu – Tunes code generation for specified processor.

Higher levels of optimization generally produce faster and smaller code, at the expense of increased compilation time. The best optimization level depends on the application requirements.

Benchmarking Setup

To measure the impact of GCC optimizations, a set of benchmarks were compiled targeting the Arm Cortex-M0+ processor on the STM32L152 Discovery Kit. The benchmarks measured execution time and code size for different algorithms like matrix multiplication, FFT, image processing, etc.

The benchmarks were compiled with GCC 8.3.1 using the following optimization levels:

-O0 – no optimization (baseline)
-O1

-O2
-O3
-Ofast – optimize for speed over standards compliance

The following additional options were used:

-mfpu=fpv4-sp-d16 (enables hardware FPU)
-mthumb (use Thumb-2 instruction set)

-mcpu=cortex-m0plus (tune for Cortex-M0+)

Execution times were measured using the SysTick cycle counter to ensure consistent timing. All benchmarks were run multiple times and averaged to minimize noise.

Benchmark Results

Here are the key highlights from the benchmark results:

Higher optimization levels consistently produce faster code. -O3 was on average 15% faster than -O0.
-Ofast yielded another 5-7% speedup over -O3 by relaxing standards compliance.
The performance gain from optimizations is more significant for complex workloads. For simple workloads, the speedup was only a few percent.

Code size decreased with higher optimization levels. -Os generated the smallest code size, around 30% smaller than -O0.
Compilation time increased significantly for higher optimization levels, up to 4X longer for -O3 compared to -O0.

The following sections summarize the benchmark results for key algorithms.

Matrix Multiplication

2048 x 2048 single precision floating point matrix multiplication
-O3 was 20% faster than -O0
-Ofast was 9% faster than -O3

Code size did not change much across optimizations

FIR Filter

400 tap FIR filter operating on audio samples
-O3 was 11% faster than -O0

-Os code size was 40% smaller than -O0
Performance scaled linearly with number of FIR taps

FFT

1024 point complex FFT using floating point

-O3 was 18% faster than -O0
-Ofast did not improve performance over -O3 due to hardware FPU limits
-Os code was about 28% smaller than -O0

JPEG Encoding

Encoding 1280×720 image using fixed point arithmetic
-O3 was 25% faster than -O0
-Os code size was 20% smaller than -O0

Higher optimization levels made huge impact due to long encoding loops

Key Takeaways

Based on these benchmarks, the following recommendations can be made for GCC optimization flags when compiling for Cortex-M0/M1:

Always use at least -O1/2 for meaningful performance gain and code size reduction

Use -O3 for most compute intensive workloads to maximize performance
Use -Ofast instead of -O3 if the application is not standards compliant
Use -Os to optimize for code size instead of speed

Profile the application before and after optimizing to ensure gains
Increase optimization levels iteratively to control compile time

Overall, the GNU compiler can generate significantly faster and smaller code for Cortex-M0/M1 through the use of optimizations. Selecting the right optimization flags requires benchmarking with the specific application workloads.

GNU-ARM Compiler Performance for Cortex-M0/M1

Overview of Cortex-M0/M1 Processors

GNU Compiler Optimization Options

Benchmarking Setup

Benchmark Results

Matrix Multiplication

FIR Filter

FFT

JPEG Encoding

Key Takeaways

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

Cortex-M0+ hangs on return

What is the ARM Calling Convention?

ARM Cortex-M7

How to make use of the GCC fixed-point types extension on ARM Cortex-M?