When programming for the Cortex-M4 chip, developers have a choice between using compiler intrinsics or handwritten assembly language for implementing mathematical functions. There are advantages and disadvantages to both approaches that should be considered when deciding which implementation to use for a particular application.
Using Compiler Intrinsics
Compiler intrinsics are functions provided by the compiler that map directly to a single machine instruction. They allow you to use C/C++ code while still taking advantage of optimized assembly instructions. Here are some of the main pros of using intrinsics for math functions on Cortex-M4:
- Easier to write and maintain – Intrinsics can be used like regular C/C++ functions. Code using intrinsics is easier to read and doesn’t require assembly language knowledge.
- Compiler optimization – The compiler may be able to better optimize code using intrinsics by inlining the functions, vs assembly which is opaque to the compiler.
- Re-targetable code – Code using intrinsics is not tied to a specific microarchitecture. This makes it more portable across Cortex-M variants.
- Parameter type safety – The compiler can check intrinsics argument types vs assembly where there is no checking.
- Easier debugging – Debugging C/C++ is generally easier than debugging raw assembly code.
Here is an example of using the __SMLAD intrinsic which maps to the SMMLA instruction. This performs a signed 32-bit multiply accumulate operation: int32_t acc = __SMLAD(x, y, acc);
Using Handwritten Assembly
Writing math functions directly in assembly language allows for very precise control over the machine code generated. Some key benefits of handwritten assembly on Cortex-M4 include:
- Maximize performance – Assembly can be highly optimized for the microarchitecture vs intrinsics which map 1:1 to instructions.
- Minimize size – Assembly can eliminate overhead from function calls and utilize processor registers efficiently.
- Utilize special instructions – Some useful instructions like saturating arithmetic may not be available as intrinsics.
- Fine tune code – More control over instruction ordering, pipelining, branching, etc to optimize performance.
- Access to registers directly – Assembly can directly access registers vs intrinsics which go through the compiler ABI.
Here is an example 32-bit multiply accumulate function in ARM assembly: MUL_ACCUMulate: SMMLA r0, r1, r2, r3 BX LR
Guidelines for Intrinsic vs Assembly Usage
Here are some guidelines on when it may be preferable to use intrinsics or handwritten assembly for math functions on Cortex-M4:
- Use intrinsics by default – For most functions, intrinsics provide a good balance of code maintainability and performance.
- Use assembly for performance critical code – Functions executed very frequently or in tight loops are good assembly candidates.
- Use assembly for small/simple functions – Small functions like multiply-accumulate are easily optimized in assembly.
- Use intrinsics for complex functions – Larger math functions with error handling and corner cases are often better in intrinsics.
- Use assembly to minimize code size – Assembly allows removing overhead to create very compact code for size constrained applications.
- Use intrinsics for portable code – If code needs to be reused across Cortex-M variants, intrinsics improve portability.
When deciding between the two options, carefully consider the requirements of your specific application. Weigh the benefits of potential performance gains from assembly vs the improved maintainability from intrinsics. Prototype and benchmark both versions when possible.
Intrinsics for Common Math Operations
Here are some commonly used ARM Cortex-M4 intrinsics for math operations and their equivalent instructions:
Multiply and Multiply-Accumulate
- __SMULBB – Signed multiply 8-bit x 8-bit, keep bottom 16 bits (SMULBB)
- __SMULBT – Signed multiply 8-bit x 16-bit, keep bottom 16 bits (SMULBT)
- __SMULTB – Signed multiply 16-bit x 8-bit, keep bottom 16 bits (SMULTB)
- __SMULTT – Signed multiply 16-bit x 16-bit, keep bottom 32 bits (SMULTT)
- __SMLAD – Signed multiply-accumulate 32 x 32 + 32 bit (SMLAD)
- __SMLADX – Signed multiply-accumulate with exchange (SMLADX)
- __SMLALD – Signed multiply-accumulate 64 x 32 + 64 bit (SMLALD)
- __SMLALDX – Signed multiply-accumulate with exchange (SMLALDX)
Saturating Arithmetic
- __SSAT – Saturate (SSAT)
- __USAT – Unsigned saturate (USAT)
- __QADD – Saturating add (QADD)
- __QSUB – Saturating subtract (QSUB)
- __QDADD – Saturating double and add (QDADD)
- __QDSUB – Saturating double and subtract (QDSUB)
Shift and Rotate
- __SXTB16 – Sign extend 16 bit to 32 bit (SXTB16)
- __SXTH – Sign extend 16 bit to 32 bit (SXTH)
- __SXTB – Sign extend 8 bit to 32 bit (SXTB)
- __UXTB16 – Zero extend 16 bit to 32 bit (UXTB16)
- __UXTH – Zero extend 16 bit to 32 bit (UXTH)
- __UXTB – Zero extend 8 bit to 32 bit (UXTB)
- __ROR – Rotate right (ROR)
- __RORS – Rotate right with extend (RORS)
Divide
- __SDIV – Signed divide (SDIV)
- __UDIV – Unsigned divide (UDIV)
Utilizing the Cortex-M4 DSP Extension
The Cortex-M4 includes an optional DSP extension with additional digital signal processing instructions. These DSP instructions include:
- Saturating arithmetic – QADD8, QSUB8, QHADD, etc.
- Dual 16-bit multiply with 32-bit accumulate – SMLAD, SMLADX
- Dual 16-bit MACs with 64-bit accumulate – SMLALD, SMLALDX
- Dual 16-bit multiply – SMUAD, SMUADX
- Bit field extraction – BFC, BFI
- Reverse bits – RBIT
The DSP extension provides significant performance improvements for DSP algorithms involving lots of 16-bit operations. To utilize these instructions from C/C++, compiler intrinsics are the best option since they map directly to the DSP instructions. For example: int32_t acc = __SMLAD(x, y, acc); // DSP 16×16 + 32 multiply accumulate
The compiler will output the optimal DSP SMLAD instruction when this intrinsic is used. Handwritten assembly can utilize the DSP instructions as well, but intrinsics integrate cleanly with C/C++ code.
Accessing SIMD Instructions
The Cortex-M4 does not contain dedicated SIMD execution units. However, it can benefit from some SIMD processing by using its core 32-bit registers to perform parallel 16-bit operations.
For example, two 16-bit values can be packed into a 32-bit register. An intrinsic like __SMLAD can then perform two 16×16 multiply-accumulates in a single cycle by operating on the packed data.
Compiler intrinsics like __PKHBT allow efficiently packing 16-bit data into registers: int32_t xy = __PKHBT(x, y, 16); // Pack x and y into xy acc = __SMLAD(xy, zw, acc);
This technique can provide some SIMD-style speedups on the Cortex-M4, despite lacking dedicated vector hardware. Assembly can also be used to manually pack data, but intrinsics help integrate with C/C++ code.
Using Fixed Point Math vs Floating Point
The Cortex-M4 does not contain hardware floating point units. It can only perform floating point operations via software libraries which emulate them. These software floating point routines are very slow compared to fixed point math done in hardware.
For most applications, implementing math using fixed point rather than floating point is recommended. The compiler intrinsics and DSP instructions operate on Q15 and Q31 formatted fixed point data. This provides good numerical range and precision performance without float inefficiency.
For example, the __SMLAD intrinsic performs the equivalent of a 32-bit x 32-bit multiply accumulate operation when using Q31 data. Large dynamic range is maintained by accumulating results into a 64 bit integer.
If floating point is absolutely required, the best approach is to minimize usage of it. Use fixed point wherever possible, and only convert to float for infrequent operations. Functionality like trigonometric and transcendental functions can often be implemented using fixed point lookup tables and polynomial approximations.
Typical Floating Point Routines
When floating point is needed, here are some common routines generally required:
- Float to fixed point conversion
- Fixed point to float conversion
- Float addition
- Float subtraction
- Float multiplication
- Float division
- Float square root
- Sine, cosine, tangent calculations
- Power functions
- Logarithms and exponentiation
These routines can be implemented using software libraries like CMSIS DSP, newlib, or developer written assembly/C code. But their usage should be minimized for maximum efficiency.
Recommendations for Math Intensive Code
For applications that are very math intensive, like digital signal processing, here are some recommendations to optimize performance on Cortex-M4:
- Use fixed point vs floating point wherever possible
- Use Q15 or Q31 formats to match DSP intrinsics
- Take advantage of parallel 16-bit operations
- Manually pack/unpack data to maximize parallelism
- Minimize memory accesses – reuse data in registers
- Unroll tight loops for pipeline efficiency
- Utilize DSP extensions if available
- Write assembly routines for key inner loop functions
Profiling tools are also valuable for identifying optimization opportunities. They can pinpoint costly functions that would benefit most from manual assembly optimization.
Conclusion
For math functions on Cortex-M4, compiler intrinsics provide a good middle ground between pure C/C++ and handwritten assembly. Intrinsics give access to optimized ARM instructions while maintaining code clarity and portability.
However, for the most performance critical software segments, writing assembly routines can extract maximum efficiency from the hardware. The Cortex-M4 and DSP extension provide a robust set of math capabilities, which can rival much larger processors when utilized efficiently in assembly.
Understanding the strengths of both intrinsics and assembly allows choosing the right implementation for each module of a Cortex-M4 application. With a mix of portable intrinsics and optimized inner loop math routines, powerful DSP performance is attainable on the Cortex-M4 processor.