When to Use Intrinsics vs Assembler for Math Functions on Cortex-M4?

When programming for the Cortex-M4 chip, developers have a choice between using compiler intrinsics or handwritten assembly language for implementing mathematical functions. There are advantages and disadvantages to both approaches that should be considered when deciding which implementation to use for a particular application.

Contents

Using Compiler Intrinsics Using Handwritten Assembly Guidelines for Intrinsic vs Assembly Usage Intrinsics for Common Math Operations Multiply and Multiply-Accumulate Saturating Arithmetic Shift and Rotate Divide Utilizing the Cortex-M4 DSP Extension Accessing SIMD Instructions Using Fixed Point Math vs Floating Point Typical Floating Point Routines Recommendations for Math Intensive Code Conclusion

Using Compiler Intrinsics

Compiler intrinsics are functions provided by the compiler that map directly to a single machine instruction. They allow you to use C/C++ code while still taking advantage of optimized assembly instructions. Here are some of the main pros of using intrinsics for math functions on Cortex-M4:

Easier to write and maintain – Intrinsics can be used like regular C/C++ functions. Code using intrinsics is easier to read and doesn’t require assembly language knowledge.

Compiler optimization – The compiler may be able to better optimize code using intrinsics by inlining the functions, vs assembly which is opaque to the compiler.
Re-targetable code – Code using intrinsics is not tied to a specific microarchitecture. This makes it more portable across Cortex-M variants.
Parameter type safety – The compiler can check intrinsics argument types vs assembly where there is no checking.

Easier debugging – Debugging C/C++ is generally easier than debugging raw assembly code.

Here is an example of using the __SMLAD intrinsic which maps to the SMMLA instruction. This performs a signed 32-bit multiply accumulate operation: int32_t acc = __SMLAD(x, y, acc);

Using Handwritten Assembly

Writing math functions directly in assembly language allows for very precise control over the machine code generated. Some key benefits of handwritten assembly on Cortex-M4 include:

Maximize performance – Assembly can be highly optimized for the microarchitecture vs intrinsics which map 1:1 to instructions.
Minimize size – Assembly can eliminate overhead from function calls and utilize processor registers efficiently.
Utilize special instructions – Some useful instructions like saturating arithmetic may not be available as intrinsics.

Fine tune code – More control over instruction ordering, pipelining, branching, etc to optimize performance.
Access to registers directly – Assembly can directly access registers vs intrinsics which go through the compiler ABI.

Here is an example 32-bit multiply accumulate function in ARM assembly: MUL_ACCUMulate: SMMLA r0, r1, r2, r3 BX LR

Guidelines for Intrinsic vs Assembly Usage

Here are some guidelines on when it may be preferable to use intrinsics or handwritten assembly for math functions on Cortex-M4:

Use intrinsics by default – For most functions, intrinsics provide a good balance of code maintainability and performance.
Use assembly for performance critical code – Functions executed very frequently or in tight loops are good assembly candidates.

Use assembly for small/simple functions – Small functions like multiply-accumulate are easily optimized in assembly.
Use intrinsics for complex functions – Larger math functions with error handling and corner cases are often better in intrinsics.
Use assembly to minimize code size – Assembly allows removing overhead to create very compact code for size constrained applications.

Use intrinsics for portable code – If code needs to be reused across Cortex-M variants, intrinsics improve portability.

When deciding between the two options, carefully consider the requirements of your specific application. Weigh the benefits of potential performance gains from assembly vs the improved maintainability from intrinsics. Prototype and benchmark both versions when possible.

Intrinsics for Common Math Operations

Here are some commonly used ARM Cortex-M4 intrinsics for math operations and their equivalent instructions:

Multiply and Multiply-Accumulate

__SMULBB – Signed multiply 8-bit x 8-bit, keep bottom 16 bits (SMULBB)
__SMULBT – Signed multiply 8-bit x 16-bit, keep bottom 16 bits (SMULBT)
__SMULTB – Signed multiply 16-bit x 8-bit, keep bottom 16 bits (SMULTB)

__SMULTT – Signed multiply 16-bit x 16-bit, keep bottom 32 bits (SMULTT)
__SMLAD – Signed multiply-accumulate 32 x 32 + 32 bit (SMLAD)
__SMLADX – Signed multiply-accumulate with exchange (SMLADX)

__SMLALD – Signed multiply-accumulate 64 x 32 + 64 bit (SMLALD)
__SMLALDX – Signed multiply-accumulate with exchange (SMLALDX)

Saturating Arithmetic

__SSAT – Saturate (SSAT)

__USAT – Unsigned saturate (USAT)
__QADD – Saturating add (QADD)
__QSUB – Saturating subtract (QSUB)

__QDADD – Saturating double and add (QDADD)
__QDSUB – Saturating double and subtract (QDSUB)

Shift and Rotate

__SXTB16 – Sign extend 16 bit to 32 bit (SXTB16)

__SXTH – Sign extend 16 bit to 32 bit (SXTH)
__SXTB – Sign extend 8 bit to 32 bit (SXTB)
__UXTB16 – Zero extend 16 bit to 32 bit (UXTB16)

__UXTH – Zero extend 16 bit to 32 bit (UXTH)
__UXTB – Zero extend 8 bit to 32 bit (UXTB)
__ROR – Rotate right (ROR)

__RORS – Rotate right with extend (RORS)

Divide

__SDIV – Signed divide (SDIV)
__UDIV – Unsigned divide (UDIV)

Utilizing the Cortex-M4 DSP Extension

The Cortex-M4 includes an optional DSP extension with additional digital signal processing instructions. These DSP instructions include:

Saturating arithmetic – QADD8, QSUB8, QHADD, etc.
Dual 16-bit multiply with 32-bit accumulate – SMLAD, SMLADX

Dual 16-bit MACs with 64-bit accumulate – SMLALD, SMLALDX
Dual 16-bit multiply – SMUAD, SMUADX
Bit field extraction – BFC, BFI

Reverse bits – RBIT

The DSP extension provides significant performance improvements for DSP algorithms involving lots of 16-bit operations. To utilize these instructions from C/C++, compiler intrinsics are the best option since they map directly to the DSP instructions. For example: int32_t acc = __SMLAD(x, y, acc); // DSP 16×16 + 32 multiply accumulate

The compiler will output the optimal DSP SMLAD instruction when this intrinsic is used. Handwritten assembly can utilize the DSP instructions as well, but intrinsics integrate cleanly with C/C++ code.

Accessing SIMD Instructions

The Cortex-M4 does not contain dedicated SIMD execution units. However, it can benefit from some SIMD processing by using its core 32-bit registers to perform parallel 16-bit operations.

For example, two 16-bit values can be packed into a 32-bit register. An intrinsic like __SMLAD can then perform two 16×16 multiply-accumulates in a single cycle by operating on the packed data.

Compiler intrinsics like __PKHBT allow efficiently packing 16-bit data into registers: int32_t xy = __PKHBT(x, y, 16); // Pack x and y into xy acc = __SMLAD(xy, zw, acc);

This technique can provide some SIMD-style speedups on the Cortex-M4, despite lacking dedicated vector hardware. Assembly can also be used to manually pack data, but intrinsics help integrate with C/C++ code.

Using Fixed Point Math vs Floating Point

The Cortex-M4 does not contain hardware floating point units. It can only perform floating point operations via software libraries which emulate them. These software floating point routines are very slow compared to fixed point math done in hardware.

For most applications, implementing math using fixed point rather than floating point is recommended. The compiler intrinsics and DSP instructions operate on Q15 and Q31 formatted fixed point data. This provides good numerical range and precision performance without float inefficiency.

For example, the __SMLAD intrinsic performs the equivalent of a 32-bit x 32-bit multiply accumulate operation when using Q31 data. Large dynamic range is maintained by accumulating results into a 64 bit integer.

If floating point is absolutely required, the best approach is to minimize usage of it. Use fixed point wherever possible, and only convert to float for infrequent operations. Functionality like trigonometric and transcendental functions can often be implemented using fixed point lookup tables and polynomial approximations.

Typical Floating Point Routines

When floating point is needed, here are some common routines generally required:

Float to fixed point conversion
Fixed point to float conversion
Float addition

Float subtraction
Float multiplication
Float division

Float square root
Sine, cosine, tangent calculations
Power functions

Logarithms and exponentiation

These routines can be implemented using software libraries like CMSIS DSP, newlib, or developer written assembly/C code. But their usage should be minimized for maximum efficiency.

Recommendations for Math Intensive Code

For applications that are very math intensive, like digital signal processing, here are some recommendations to optimize performance on Cortex-M4:

Use fixed point vs floating point wherever possible
Use Q15 or Q31 formats to match DSP intrinsics
Take advantage of parallel 16-bit operations

Manually pack/unpack data to maximize parallelism
Minimize memory accesses – reuse data in registers
Unroll tight loops for pipeline efficiency

Utilize DSP extensions if available
Write assembly routines for key inner loop functions

Profiling tools are also valuable for identifying optimization opportunities. They can pinpoint costly functions that would benefit most from manual assembly optimization.

Conclusion

For math functions on Cortex-M4, compiler intrinsics provide a good middle ground between pure C/C++ and handwritten assembly. Intrinsics give access to optimized ARM instructions while maintaining code clarity and portability.

However, for the most performance critical software segments, writing assembly routines can extract maximum efficiency from the hardware. The Cortex-M4 and DSP extension provide a robust set of math capabilities, which can rival much larger processors when utilized efficiently in assembly.

Understanding the strengths of both intrinsics and assembly allows choosing the right implementation for each module of a Cortex-M4 application. With a mix of portable intrinsics and optimized inner loop math routines, powerful DSP performance is attainable on the Cortex-M4 processor.

When to Use Intrinsics vs Assembler for Math Functions on Cortex-M4?

Using Compiler Intrinsics

Using Handwritten Assembly

Guidelines for Intrinsic vs Assembly Usage

Intrinsics for Common Math Operations

Multiply and Multiply-Accumulate

Saturating Arithmetic

Shift and Rotate

Divide

Utilizing the Cortex-M4 DSP Extension

Accessing SIMD Instructions

Using Fixed Point Math vs Floating Point

Typical Floating Point Routines

Recommendations for Math Intensive Code

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

Cortex M4 Write Buffer Explained

Does arm cortex-M4 have stages of pipeline?

Reducing Load/Store Instruction Latency on Cortex M4

Reducing Context Switch Overhead with FPU Registers on Cortex-M4