SoC
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
  • Arm Cortex M3
  • Contact
Reading: Reorganising C code to be optimal for Thumb-1 Instruction-Set with Cortex M0+
SUBSCRIBE
SoCSoC
Font ResizerAa
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Search
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Have an existing account? Sign In
Follow US
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
© S-O-C.ORG, All Rights Reserved.
Arm

Reorganising C code to be optimal for Thumb-1 Instruction-Set with Cortex M0+

Neil Salmon
Last updated: October 5, 2023 9:58 am
Neil Salmon 6 Min Read
Share
SHARE

Cortex-M0+ processors utilize the Thumb-1 instruction set which is optimized for code density rather than performance. While compact code size is advantageous for microcontrollers with limited memory, it can result in suboptimal execution speed. By reorganizing C code in certain ways, we can optimize it to make better use of the Thumb-1 instruction set and achieve faster execution on Cortex-M0+ cores.

Contents
Understanding Thumb-1 Instruction SetOptimizing Function CallsEfficient Looping ConstructsOptimized Expression EvaluationEfficient Data AccessLeverage 16-bit Thumb-2 InstructionsOptimize Code for Size or SpeedProfiling with Performance CountersLeveraging Compiler IntrinsicsConclusion

Understanding Thumb-1 Instruction Set

The Thumb-1 instruction set used in Cortex-M0+ has a 16-bit encoding compared to 32-bit encoding of ARM instructions. This allows for higher code density as Thumb instructions take up half the space. However, Thumb-1 lacks many powerful ARM instructions and has limitations like:

  • Only 8 general purpose registers are available in Thumb state compared to 13 in ARM state
  • No barrel shifter support, only shifts by immediate value are available
  • Limited operand choices for arithmetic/logical instructions
  • No conditional execution support except for BX instruction

Due to these constraints, certain coding practices that work well on ARM cores will be suboptimal on Thumb-1. The key is to understand these limitations and adopt techniques that generate efficient Thumb-1 assembly.

Optimizing Function Calls

Function calls have higher overhead in Thumb state because link registers like LR need to be pushed/popped from the stack. Some ways to optimize function calls are:

  • Minimize call depth by flattening code when possible
  • Use inline functions instead of separate function calls
  • Pass function parameters in registers instead of stack where possible
  • Use tail call optimization to avoid stack manipulations

For example, a recursive factorial function can be rewritten iteratively to save function call overhead. Unnecessary call depth and stack manipulation is reduced.

Efficient Looping Constructs

Complex loops with multiple branches have high overhead on Thumb-1. We can optimize loops by:

  • Unrolling small loops to reduce branch instructions
  • Using loop counters that decrement to avoid compare instruction each iteration
  • Minimizing operations inside loop comparison check
  • Using conditional instructions instead of branch inside loop body

For example, a loop that increments an index and compares to upper limit can be rewritten to decrement a counter from number of iterations to zero. This avoids the compare instruction on each loop iteration.

Optimized Expression Evaluation

Complex arithmetic and logical expressions can be optimized by:

  • Using equivalent expression with fewer instructions
  • Precomputing invariants outside expression
  • Reordering commutative operators
  • Minimizing loss of precision in floating point ops

For example, (a + b) – (c + d) can be reordered as (a – c) + (b – d) to save operations. Also, floating point additions are rearranged to minimize loss of precision.

Efficient Data Access

Data access patterns can be optimized for Thumb-1 as follows:

  • Group consecutive data items to avoid multiple loads/stores
  • Use array indices that increment to leverage auto-increment addressing mode
  • Minimize stack access by keeping data in registers
  • Use shift and mask operations instead of byte access instructions

For example, four byte variables can be grouped into a single word accessed through one LDR instruction instead of multiple LDRB. Post-increment addressing mode also helps reduce instruction count in array access loops.

Leverage 16-bit Thumb-2 Instructions

Some Thumb-2 encodings are available even in Thumb-1 state on Cortex-M0+ like:

  • CMP, ADD, SUB, etc. with three register operands
  • LDR/STR instructions with register offset
  • CBNZ and CBZ compare and branch if zero/nonzero
  • IT and ITE conditional blocks

Wherever possible, use these Thumb-2 encodings to save instructions. For example, CMP Rx, Ry, Rz saves one instruction over separate CMP and SUB.

Optimize Code for Size or Speed

Various C compiler options like -Os and -O2 affect code generation priorities. For Cortex-M0+:

  • -Os optimizes for size but can hurt performance
  • -O2 optimizes for speed at cost of size
  • -Oz balances speed and size based on application

Profile and benchmark code with different optimization levels to determine the best fit. Aggressively optimizing for speed could overweight the limited Cortex-M0+ cache.

Profiling with Performance Counters

The Cortex-M0+ performance counters can be used to profile and identify optimization opportunities:

  • Cycle count: Checks execution time of functions
  • Instruction count: Highlights inefficient instruction sequences
  • Stall cycles: Indicates pipeline stalls due to branches, memory access etc.
  • Code density: Measures compactness of generated code

After making optimizations, verify improvements by comparing before and after metrics. This helps make data-driven optimization decisions.

Leveraging Compiler Intrinsics

Compiler intrinsics for specific instructions can be used to optimize critical code sections. Useful Thumb-1 intrinsics include:

  • __nop(): Inserts no operation instruction
  • __rev(): Reverse byte order of value
  • __ror(): Rotate value right
  • __svc(): Generate supervisor call exception

Intrinsics give finer control over generated assembly for time/space critical code. However, moderation is needed as it reduces compiler portability.

Conclusion

With the Cortex-M0+ specific limitations in mind, reorganizing C code using the techniques discussed can provide significant performance gains. A 50% or more boost in execution speed is often possible with such optimizations. The trade-off is increase in code size which may be acceptable for performance critical code. To summarize, optimizing C code for Thumb-1 requires leveraging the 16-bit instruction set efficiently, minimizing branches and function calls, simplifying expressions, optimizing data access, using specific compiler options, and profiling with performance counters.

Newsletter Form (#3)

More ARM insights right in your inbox

 


Share This Article
Facebook Twitter Email Copy Link Print
Previous Article Cortex-M0+ Flash Download failed
Next Article How to Run a Cycle Mode (DSM=yes) for CORTEX-M0 Processor?
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

2k Followers Like
3k Followers Follow
10.1k Followers Pin
- Sponsored-
Ad image

You Might Also Like

ARM Calling Convention Return Value

The ARM calling convention refers to how function arguments are…

5 Min Read

ARM Processor vs Intel

When it comes to choosing a processor, the two biggest…

9 Min Read

Arm Cortex M0 Verilog Code

The Arm Cortex-M0 is an ultra low power 32-bit RISC…

11 Min Read

Options for Acquiring Cortex-M1 and M0 Soft Cores

There are a few options available for acquiring Cortex-M1 and…

6 Min Read
SoCSoC
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
Welcome Back!

Sign in to your account