Cortex-M0+ processors utilize the Thumb-1 instruction set which is optimized for code density rather than performance. While compact code size is advantageous for microcontrollers with limited memory, it can result in suboptimal execution speed. By reorganizing C code in certain ways, we can optimize it to make better use of the Thumb-1 instruction set and achieve faster execution on Cortex-M0+ cores.
Understanding Thumb-1 Instruction Set
The Thumb-1 instruction set used in Cortex-M0+ has a 16-bit encoding compared to 32-bit encoding of ARM instructions. This allows for higher code density as Thumb instructions take up half the space. However, Thumb-1 lacks many powerful ARM instructions and has limitations like:
- Only 8 general purpose registers are available in Thumb state compared to 13 in ARM state
- No barrel shifter support, only shifts by immediate value are available
- Limited operand choices for arithmetic/logical instructions
- No conditional execution support except for BX instruction
Due to these constraints, certain coding practices that work well on ARM cores will be suboptimal on Thumb-1. The key is to understand these limitations and adopt techniques that generate efficient Thumb-1 assembly.
Optimizing Function Calls
Function calls have higher overhead in Thumb state because link registers like LR need to be pushed/popped from the stack. Some ways to optimize function calls are:
- Minimize call depth by flattening code when possible
- Use inline functions instead of separate function calls
- Pass function parameters in registers instead of stack where possible
- Use tail call optimization to avoid stack manipulations
For example, a recursive factorial function can be rewritten iteratively to save function call overhead. Unnecessary call depth and stack manipulation is reduced.
Efficient Looping Constructs
Complex loops with multiple branches have high overhead on Thumb-1. We can optimize loops by:
- Unrolling small loops to reduce branch instructions
- Using loop counters that decrement to avoid compare instruction each iteration
- Minimizing operations inside loop comparison check
- Using conditional instructions instead of branch inside loop body
For example, a loop that increments an index and compares to upper limit can be rewritten to decrement a counter from number of iterations to zero. This avoids the compare instruction on each loop iteration.
Optimized Expression Evaluation
Complex arithmetic and logical expressions can be optimized by:
- Using equivalent expression with fewer instructions
- Precomputing invariants outside expression
- Reordering commutative operators
- Minimizing loss of precision in floating point ops
For example, (a + b) – (c + d) can be reordered as (a – c) + (b – d) to save operations. Also, floating point additions are rearranged to minimize loss of precision.
Efficient Data Access
Data access patterns can be optimized for Thumb-1 as follows:
- Group consecutive data items to avoid multiple loads/stores
- Use array indices that increment to leverage auto-increment addressing mode
- Minimize stack access by keeping data in registers
- Use shift and mask operations instead of byte access instructions
For example, four byte variables can be grouped into a single word accessed through one LDR instruction instead of multiple LDRB. Post-increment addressing mode also helps reduce instruction count in array access loops.
Leverage 16-bit Thumb-2 Instructions
Some Thumb-2 encodings are available even in Thumb-1 state on Cortex-M0+ like:
- CMP, ADD, SUB, etc. with three register operands
- LDR/STR instructions with register offset
- CBNZ and CBZ compare and branch if zero/nonzero
- IT and ITE conditional blocks
Wherever possible, use these Thumb-2 encodings to save instructions. For example, CMP Rx, Ry, Rz saves one instruction over separate CMP and SUB.
Optimize Code for Size or Speed
Various C compiler options like -Os and -O2 affect code generation priorities. For Cortex-M0+:
- -Os optimizes for size but can hurt performance
- -O2 optimizes for speed at cost of size
- -Oz balances speed and size based on application
Profile and benchmark code with different optimization levels to determine the best fit. Aggressively optimizing for speed could overweight the limited Cortex-M0+ cache.
Profiling with Performance Counters
The Cortex-M0+ performance counters can be used to profile and identify optimization opportunities:
- Cycle count: Checks execution time of functions
- Instruction count: Highlights inefficient instruction sequences
- Stall cycles: Indicates pipeline stalls due to branches, memory access etc.
- Code density: Measures compactness of generated code
After making optimizations, verify improvements by comparing before and after metrics. This helps make data-driven optimization decisions.
Leveraging Compiler Intrinsics
Compiler intrinsics for specific instructions can be used to optimize critical code sections. Useful Thumb-1 intrinsics include:
- __nop(): Inserts no operation instruction
- __rev(): Reverse byte order of value
- __ror(): Rotate value right
- __svc(): Generate supervisor call exception
Intrinsics give finer control over generated assembly for time/space critical code. However, moderation is needed as it reduces compiler portability.
Conclusion
With the Cortex-M0+ specific limitations in mind, reorganizing C code using the techniques discussed can provide significant performance gains. A 50% or more boost in execution speed is often possible with such optimizations. The trade-off is increase in code size which may be acceptable for performance critical code. To summarize, optimizing C code for Thumb-1 requires leveraging the 16-bit instruction set efficiently, minimizing branches and function calls, simplifying expressions, optimizing data access, using specific compiler options, and profiling with performance counters.