SoC
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
  • Arm Cortex M3
  • Contact
Reading: Workarounds for GNU-ARM Compiler Inefficiencies on Cortex-M0/M1
SUBSCRIBE
SoCSoC
Font ResizerAa
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Search
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Have an existing account? Sign In
Follow US
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
© S-O-C.ORG, All Rights Reserved.
Arm

Workarounds for GNU-ARM Compiler Inefficiencies on Cortex-M0/M1

Andrew Irwin
Last updated: September 17, 2023 2:20 am
Andrew Irwin 6 Min Read
Share
SHARE

The GNU ARM compiler (arm-none-eabi-gcc) is a widely used toolchain for compiling code for ARM Cortex-M0 and Cortex-M1 microcontrollers. However, the compiler is not always able to generate optimal code for these resource-constrained chips. This can lead to inefficient code that takes up more flash memory, runs slower, and consumes extra power. Fortunately, with some workarounds during coding and compilation, many of these inefficiencies can be avoided.

Contents
Loop UnrollingFunction InliningOptimized Library FunctionsEfficient Data AccessMemory Regions for Switch StatementsStack UsageInitialization of Large Static ArraysAlignment of BuffersUse of Cortex-M0+/M1 DSP InstructionsTight Assembly LoopsReduce Code Size with Thumb-1Compiler OptimizationsSummary

Loop Unrolling

One common issue is that the compiler does not unroll small loops. Unrolling a loop reduces the number of branch instructions needed, avoiding pipeline stalls. This can significantly speed up execution. But the compiler often fails to apply this optimization automatically. To force loop unrolling, use pragmas:

#pragma unroll_loop_start 
for(i = 0; i < 16; i++) {
   ...
}
#pragma unroll_loop_end

Be careful not to unroll loops too much, as it can bloat code size. Test different unroll amounts to find the best performance vs. size trade-off.

Function Inlining

Inlining small functions removes function call overhead. But the compiler heuristics are sometimes too conservative, missing inlining opportunities. For important functions, use attributes to force inlining:

inline __attribute__((always_inline)) void myFunction() {
   ...
}

This inlines the function even when the compiler deems it too large. Again, balance the benefits against potential code size increase.

Optimized Library Functions

The C standard libraries contain unoptimized generic implementations of functions like memcpy. These under-utilize Cortex-M0/M1 features. Instead, use optimized implementations designed specifically for Cortex-M. For example:

#include "cmsis_gcc.h"
...
__gnu_ thumb1_memcpy(dst, src, n); 

The __gnu_thumb1 versions use hardware features like load/store multiple instructions. This accelerates small copies significantly.

Efficient Data Access

Cortex-M0/M1 have only half-word or byte data access without instruction bus accesses. So access data using the smallest type possible:

uint8_t array[128];
...
uint8_t b = array[i]; // Not uint32_t

This avoids having to do read-modify-write for parts of larger variables. Accessing single bytes or halfwords is efficient.

Memory Regions for Switch Statements

Large switch statements can compile to inefficient jump tables. Putting the jump table in a special memory region allows optimized access:

jumptable_section_start
jumptable_section_entry(values)
jumptable_section_end

__attribute__((section(".jumptable_section")))
const uint32_t values[] = { ... }; 

switch(x) {
jumptable_section_entry(values)
   ...
}

This creates a compact jump table using 16-bit offsets. Otherwise, 32-bit pointers may be needed, increasing size.

Stack Usage

Cortex-M0/M1 have very small stack sizes, often just a few KB. Avoid recursive functions and large stack variables. Where possible, allocate larger data statically or dynamically on the heap instead. Stack overflows are difficult to debug on Cortex-M.

// Avoid
uint8_t stackArray[1024]; 

// Prefer
static uint8_t staticArray[1024];

Use stack checking and watermark features offered by various RTOS and debug tools. This helps catch issues early.

Initialization of Large Static Arrays

Large static arrays should be initialized like:

static uint32_t values[] = {0x13, 0x37, ... };

The compiler stores the initial values compactly and copies them to RAM on startup. Avoid this:

static uint32_t values[1024];

void init() {
  for(int i = 0; i < 1024; i++) {
     values[i] = i; 
  }
}

The code version executes at runtime, is slower, and consumes energy.

Alignment of Buffers

Use word aligned buffers where possible:

uint32_t buffer[128] __attribute__((aligned(4)));

This enables word and halfword access without bus errors. For DMA, alignment to cache lines may provide a big speedup.

Use of Cortex-M0+/M1 DSP Instructions

The DSP extension provides useful instructions for signal processing like bit rearrangement. Use intrinsics to leverage these:

uint32_t reversed = __REV(value); // Reverse bits

This uses a single instruction instead of multiple shifts/rotates.

Tight Assembly Loops

For critical loops, drop to assembly to optimize further than the compiler:

asm volatile(
   ".syntax unified           \n"
   "movs r2, #16               \n" 
   "loop:                     \n"
   " SUBS R2, R2, #1          \n" 
   " mul r3, r0, r1           \n"
   " adds r0, r3              \n"
   " bne loop                 \n"
   : "+r" (r0)
   : "r" (r1)
   : "r2", "r3"
); 

This achieves higher performance but is not portable. Use only when needed.

Reduce Code Size with Thumb-1

Thumb-1 instruction set has higher code density than Thumb-2. Functions rarely needing 32-bit instructions can benefit:

__attribute__((target("thumb-1")))
void func() {
   ...
}

This compiles the function to Thumb-1. But beware, Thumb-1 has limitations like smaller branch ranges.

Compiler Optimizations

Aggressive optimizations can sometimes improve performance but increase size. Benchmark with options like:

-O3 
-funroll-loops
-finline-functions 
-fsingle-precision-constant
-fno-math-errno

Be careful not to push the compiler too hard, as it may generate illegal instructions in some cases.

Summary

The Cortex-M0 and Cortex-M1 are optimized for low-power IoT endpoints. But getting the most from them requires working around compiler inefficiencies during coding and compilation. Techniques like manual loop unrolling, memory layout optimizations, and using tight assembly can help boost performance, reduce code size, and lower power consumption. With attention to details like this, developers can squeeze every last drop of efficiency from these microcontroller workhorses.

Newsletter Form (#3)

More ARM insights right in your inbox

 


Share This Article
Facebook Twitter Email Copy Link Print
Previous Article Understanding Code Generation Issues with GNU-ARM for Cortex-M0/M1
Next Article Efficient Code Generation with GNU-ARM for Cortex-M0/M1
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

2k Followers Like
3k Followers Follow
10.1k Followers Pin
- Sponsored-
Ad image

You Might Also Like

Is ARM Cortex Little Endian or Big Endian?

ARM Cortex processors can operate in either little endian or…

7 Min Read

What causes hard fault in arm cortex?

A hard fault on an ARM Cortex processor is an…

8 Min Read

Why Cortex-M Requires Its First Word as Initial Stack Pointer?

The Cortex-M processor is an extremely popular 32-bit ARM processor…

6 Min Read

What is arm usage fault?

An arm usage fault is an exception that occurs when…

10 Min Read
SoCSoC
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
Welcome Back!

Sign in to your account