Reorganising C code to be optimal for Thumb-1 Instruction-Set with Cortex M0+

Cortex-M0+ processors utilize the Thumb-1 instruction set which is optimized for code density rather than performance. While compact code size is advantageous for microcontrollers with limited memory, it can result in suboptimal execution speed. By reorganizing C code in certain ways, we can optimize it to make better use of the Thumb-1 instruction set and achieve faster execution on Cortex-M0+ cores.

Contents

Understanding Thumb-1 Instruction Set Optimizing Function Calls Efficient Looping Constructs Optimized Expression Evaluation Efficient Data Access Leverage 16-bit Thumb-2 Instructions Optimize Code for Size or Speed Profiling with Performance Counters Leveraging Compiler Intrinsics Conclusion

Understanding Thumb-1 Instruction Set

The Thumb-1 instruction set used in Cortex-M0+ has a 16-bit encoding compared to 32-bit encoding of ARM instructions. This allows for higher code density as Thumb instructions take up half the space. However, Thumb-1 lacks many powerful ARM instructions and has limitations like:

Only 8 general purpose registers are available in Thumb state compared to 13 in ARM state

No barrel shifter support, only shifts by immediate value are available
Limited operand choices for arithmetic/logical instructions
No conditional execution support except for BX instruction

Due to these constraints, certain coding practices that work well on ARM cores will be suboptimal on Thumb-1. The key is to understand these limitations and adopt techniques that generate efficient Thumb-1 assembly.

Optimizing Function Calls

Function calls have higher overhead in Thumb state because link registers like LR need to be pushed/popped from the stack. Some ways to optimize function calls are:

Minimize call depth by flattening code when possible

Use inline functions instead of separate function calls
Pass function parameters in registers instead of stack where possible
Use tail call optimization to avoid stack manipulations

For example, a recursive factorial function can be rewritten iteratively to save function call overhead. Unnecessary call depth and stack manipulation is reduced.

Efficient Looping Constructs

Complex loops with multiple branches have high overhead on Thumb-1. We can optimize loops by:

Unrolling small loops to reduce branch instructions

Using loop counters that decrement to avoid compare instruction each iteration
Minimizing operations inside loop comparison check
Using conditional instructions instead of branch inside loop body

For example, a loop that increments an index and compares to upper limit can be rewritten to decrement a counter from number of iterations to zero. This avoids the compare instruction on each loop iteration.

Optimized Expression Evaluation

Complex arithmetic and logical expressions can be optimized by:

Using equivalent expression with fewer instructions

Precomputing invariants outside expression
Reordering commutative operators
Minimizing loss of precision in floating point ops

For example, (a + b) – (c + d) can be reordered as (a – c) + (b – d) to save operations. Also, floating point additions are rearranged to minimize loss of precision.

Efficient Data Access

Data access patterns can be optimized for Thumb-1 as follows:

Group consecutive data items to avoid multiple loads/stores

Use array indices that increment to leverage auto-increment addressing mode
Minimize stack access by keeping data in registers
Use shift and mask operations instead of byte access instructions

For example, four byte variables can be grouped into a single word accessed through one LDR instruction instead of multiple LDRB. Post-increment addressing mode also helps reduce instruction count in array access loops.

Leverage 16-bit Thumb-2 Instructions

Some Thumb-2 encodings are available even in Thumb-1 state on Cortex-M0+ like:

CMP, ADD, SUB, etc. with three register operands

LDR/STR instructions with register offset
CBNZ and CBZ compare and branch if zero/nonzero
IT and ITE conditional blocks

Wherever possible, use these Thumb-2 encodings to save instructions. For example, CMP Rx, Ry, Rz saves one instruction over separate CMP and SUB.

Optimize Code for Size or Speed

Various C compiler options like -Os and -O2 affect code generation priorities. For Cortex-M0+:

-Os optimizes for size but can hurt performance

-O2 optimizes for speed at cost of size
-Oz balances speed and size based on application

Profile and benchmark code with different optimization levels to determine the best fit. Aggressively optimizing for speed could overweight the limited Cortex-M0+ cache.

Profiling with Performance Counters

The Cortex-M0+ performance counters can be used to profile and identify optimization opportunities:

Cycle count: Checks execution time of functions
Instruction count: Highlights inefficient instruction sequences

Stall cycles: Indicates pipeline stalls due to branches, memory access etc.
Code density: Measures compactness of generated code

After making optimizations, verify improvements by comparing before and after metrics. This helps make data-driven optimization decisions.

Leveraging Compiler Intrinsics

Compiler intrinsics for specific instructions can be used to optimize critical code sections. Useful Thumb-1 intrinsics include:

__nop(): Inserts no operation instruction
__rev(): Reverse byte order of value

__ror(): Rotate value right
__svc(): Generate supervisor call exception

Intrinsics give finer control over generated assembly for time/space critical code. However, moderation is needed as it reduces compiler portability.

Conclusion

With the Cortex-M0+ specific limitations in mind, reorganizing C code using the techniques discussed can provide significant performance gains. A 50% or more boost in execution speed is often possible with such optimizations. The trade-off is increase in code size which may be acceptable for performance critical code. To summarize, optimizing C code for Thumb-1 requires leveraging the 16-bit instruction set efficiently, minimizing branches and function calls, simplifying expressions, optimizing data access, using specific compiler options, and profiling with performance counters.

Reorganising C code to be optimal for Thumb-1 Instruction-Set with Cortex M0+

Understanding Thumb-1 Instruction Set

Optimizing Function Calls

Efficient Looping Constructs

Optimized Expression Evaluation

Efficient Data Access

Leverage 16-bit Thumb-2 Instructions

Optimize Code for Size or Speed

Profiling with Performance Counters

Leveraging Compiler Intrinsics

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

What is Instrumentation Trace Macrocell (ITM) in Arm Cortex-M?

Bootloader Code Example

What is Data Cache in Arm Cortex-M series?

How many core integer registers does the Cortex-M4 processor have?