ARM Neon Intrinsics

ARM Neon is a Single Instruction Multiple Data (SIMD) architecture extension for ARM Cortex-A series processors. It provides vector processing capabilities that allow operations to be performed on multiple data elements concurrently, greatly improving performance for multimedia and signal processing workloads.

Contents

Overview of Neon Intrinsics Using Neon Intrinsics in Code Intrinsic Categories Load and Store Intrinsics Arithmetic Intrinsics Comparison Intrinsics Logical Intrinsics Permutation Intrinsics Conversion Intrinsics Table Lookup Intrinsics Special Value Intrinsics Integration with ARM Compiler Usage Guidelines Conclusion

Neon intrinsics allow developers to directly access the Neon vector instructions from C and C++ code. Intrinsics are functions that map to specific machine instructions. Using Neon intrinsics can provide performance improvements over regular C/C++ code by executing vector operations instead of scalar operations.

Overview of Neon Intrinsics

The main advantages of using Neon intrinsics are:

Improved performance through parallelism and pipelining
Access to advanced SIMD instructions not available in regular C/C++
Ability to operate on multiple data elements in a single instruction

Compatible with C/C++ code for ease of integration
Portable across different ARM architectures that support Neon

ARM provides an extensive set of Neon intrinsics that cover various vector data types like floats, integers, and polynomials. The intrinsics provide operations like arithmetic, logical, load/store, comparison, conversion, permutation, etc.

Neon supports 128-bit wide vector registers that can fit multiple smaller data elements like 32-bit ints or 32-bit floats. This allows operations to work on multiple elements concurrently.

Using Neon Intrinsics in Code

To use Neon intrinsics in C/C++ code, the arm_neon.h header file needs to be included. This contains declarations for all the Neon intrinsic functions.


#include &lt;arm_neon.h&gt;

The first step is to define Neon vector variables to hold the vector operands. Various intrinsics like float32x4_t, int32x4_t, uint8x16_t etc. can be used depending on the vector data type required.


float32x4_t va, vb, vc; //128-bit vectors holding 4 floats each

Neon intrinsic functions can then be called to perform operations on these vector variables:


vc = vaddq_f32(va, vb); // add va and vb vectors

This will execute a SIMD add operation on 4 floats concurrently by using the Neon vector unit. Various other intrinsics like vmulq_f32 (multiply), vsubq_f32 (subtract), vmaxq_f32 (maximum) etc. are available.

Vector data needs to be loaded from memory into the Neon registers before operations can be performed. The vld1q_f32 intrinsic can load a 128-bit float32x4_t vector from a memory address:

  
float32_t in_data[4];
//load data from memory into vector register
float32x4_t va = vld1q_f32(in_data);

Results can be stored back to memory using the vst1q_f32 intrinsic:


float32_t out_data[4];
//store vector register back to memory
vst1q_f32(out_data, vc);

By using Neon intrinsics to vectorize key loops and algorithms, significant speedups can be obtained compared to scalar code.

Intrinsic Categories

The Neon intrinsics can be categorized into several groups based on functionality:

Load and Store Intrinsics

Used to move data between Neon vector registers and memory. E.g. vld1q_f32, vst1q_f32

Arithmetic Intrinsics

Perform arithmetic operations like add, subtract, multiply, divide on vector registers. E.g. vaddq_f32, vmulq_f32

Comparison Intrinsics

Compare vector registers element-wise based on less-than, greater-than etc. E.g. vcltq_f32, vcgtq_f32

Logical Intrinsics

Perform logical operations like AND, OR, XOR, NOT on vector registers. E.g. vandq_u32, vorrq_u32

Permutation Intrinsics

Ways to rearrange and swap vector elements within registers. E.g. vrev64q_f32, vzipq_f32

Conversion Intrinsics

Convert between different vector data types like int to float. E.g. vcvtq_f32_s32, vcvt_s32_f32

Table Lookup Intrinsics

Lookup vector elements in a table or map. E.g. vtbl1_s8, vqtbl1q_s8

Special Value Intrinsics

Ways to generate common special values like 0, 1, max, min etc. E.g. vdupq_n_f32, vsetq_lane_f32

There are many more Neon intrinsics available to provide optimized implementations of common functions.

Integration with ARM Compiler

The ARM Compiler toolchain supports auto-vectorization features that can convert scalar C/C++ loops into Neon intrinsics automatically. This allows compiling code to make use of Neon without directly writing intrinsics.

For example, the following simple loop:


for (int i=0; i&lt;LEN; i++)
   c[i] = a[i] + b[i];

Can be auto-vectorized by enabling -O3 optimization in the ARM Compiler. The loop will be converted to use Neon intrinsics and load, add, and store operations will work on vector registers.

The compiler auto-vectorization may not work in all cases, especially if the loop contains dependencies or function calls. Intrinsics still allow finer control over vectorization.

Usage Guidelines

Here are some guidelines on using Neon intrinsics effectively:

Identify key loops or code sections that dominate runtime – these are good candidates for vectorization.

Check dependencies between loop iterations and resolve if needed.
Load data usingNeon load intrinsics like vld1q_f32 before the loop.
Use separate vector variables for each intermediate result.

Minimize extraneous operations inside the loop.
Use store intrinsics like vst1q_f32 to save results after the loop.
May need to peel loops or handle partial vector loads at ends.

Browse documentation and examples for the wide range of available intrinsics.

With appropriately vectorized code, Neon intrinsics can often provide 2-4X performance speedups compared to scalar code. Measure performance before and after to validate gains.

Conclusion

Neon intrinsics provide direct access to the SIMD capabilities of ARM processors, allowing operations on multiple data elements concurrently. Intrinsics map C/C++ functions to underlying Neon instructions. Significant performance gains can be obtained by vectorizing key loops, especially in multimedia and signal processing. ARM also provides auto-vectorization in its compiler toolchain. Following best practices for utilizing intrinsics can help unlock the full benefits of Neon.

ARM Neon Intrinsics

Overview of Neon Intrinsics

Using Neon Intrinsics in Code

Intrinsic Categories

Load and Store Intrinsics

Arithmetic Intrinsics

Comparison Intrinsics

Logical Intrinsics

Permutation Intrinsics

Conversion Intrinsics

Table Lookup Intrinsics

Special Value Intrinsics

Integration with ARM Compiler

Usage Guidelines

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

Little Endian Explained

Do I Need to Run a Separate Flash Programmer Software for Custom SOC with Cortex M0?

RP2040 vs ESP32: How Do These Popular Microcontrollers Compare?

ARM Cortex M0(PGA970) set Primask/disable interrupts