ARM Neon is a Single Instruction Multiple Data (SIMD) architecture extension for ARM Cortex-A series processors. It provides vector processing capabilities that allow operations to be performed on multiple data elements concurrently, greatly improving performance for multimedia and signal processing workloads.
Neon intrinsics allow developers to directly access the Neon vector instructions from C and C++ code. Intrinsics are functions that map to specific machine instructions. Using Neon intrinsics can provide performance improvements over regular C/C++ code by executing vector operations instead of scalar operations.
Overview of Neon Intrinsics
The main advantages of using Neon intrinsics are:
- Improved performance through parallelism and pipelining
- Access to advanced SIMD instructions not available in regular C/C++
- Ability to operate on multiple data elements in a single instruction
- Compatible with C/C++ code for ease of integration
- Portable across different ARM architectures that support Neon
ARM provides an extensive set of Neon intrinsics that cover various vector data types like floats, integers, and polynomials. The intrinsics provide operations like arithmetic, logical, load/store, comparison, conversion, permutation, etc.
Neon supports 128-bit wide vector registers that can fit multiple smaller data elements like 32-bit ints or 32-bit floats. This allows operations to work on multiple elements concurrently.
Using Neon Intrinsics in Code
To use Neon intrinsics in C/C++ code, the arm_neon.h header file needs to be included. This contains declarations for all the Neon intrinsic functions.
#include <arm_neon.h>
The first step is to define Neon vector variables to hold the vector operands. Various intrinsics like float32x4_t, int32x4_t, uint8x16_t etc. can be used depending on the vector data type required.
float32x4_t va, vb, vc; //128-bit vectors holding 4 floats each
Neon intrinsic functions can then be called to perform operations on these vector variables:
vc = vaddq_f32(va, vb); // add va and vb vectors
This will execute a SIMD add operation on 4 floats concurrently by using the Neon vector unit. Various other intrinsics like vmulq_f32 (multiply), vsubq_f32 (subtract), vmaxq_f32 (maximum) etc. are available.
Vector data needs to be loaded from memory into the Neon registers before operations can be performed. The vld1q_f32 intrinsic can load a 128-bit float32x4_t vector from a memory address:
float32_t in_data[4];
//load data from memory into vector register
float32x4_t va = vld1q_f32(in_data);
Results can be stored back to memory using the vst1q_f32 intrinsic:
float32_t out_data[4];
//store vector register back to memory
vst1q_f32(out_data, vc);
By using Neon intrinsics to vectorize key loops and algorithms, significant speedups can be obtained compared to scalar code.
Intrinsic Categories
The Neon intrinsics can be categorized into several groups based on functionality:
Load and Store Intrinsics
Used to move data between Neon vector registers and memory. E.g. vld1q_f32, vst1q_f32
Arithmetic Intrinsics
Perform arithmetic operations like add, subtract, multiply, divide on vector registers. E.g. vaddq_f32, vmulq_f32
Comparison Intrinsics
Compare vector registers element-wise based on less-than, greater-than etc. E.g. vcltq_f32, vcgtq_f32
Logical Intrinsics
Perform logical operations like AND, OR, XOR, NOT on vector registers. E.g. vandq_u32, vorrq_u32
Permutation Intrinsics
Ways to rearrange and swap vector elements within registers. E.g. vrev64q_f32, vzipq_f32
Conversion Intrinsics
Convert between different vector data types like int to float. E.g. vcvtq_f32_s32, vcvt_s32_f32
Table Lookup Intrinsics
Lookup vector elements in a table or map. E.g. vtbl1_s8, vqtbl1q_s8
Special Value Intrinsics
Ways to generate common special values like 0, 1, max, min etc. E.g. vdupq_n_f32, vsetq_lane_f32
There are many more Neon intrinsics available to provide optimized implementations of common functions.
Integration with ARM Compiler
The ARM Compiler toolchain supports auto-vectorization features that can convert scalar C/C++ loops into Neon intrinsics automatically. This allows compiling code to make use of Neon without directly writing intrinsics.
For example, the following simple loop:
for (int i=0; i<LEN; i++)
c[i] = a[i] + b[i];
Can be auto-vectorized by enabling -O3 optimization in the ARM Compiler. The loop will be converted to use Neon intrinsics and load, add, and store operations will work on vector registers.
The compiler auto-vectorization may not work in all cases, especially if the loop contains dependencies or function calls. Intrinsics still allow finer control over vectorization.
Usage Guidelines
Here are some guidelines on using Neon intrinsics effectively:
- Identify key loops or code sections that dominate runtime – these are good candidates for vectorization.
- Check dependencies between loop iterations and resolve if needed.
- Load data usingNeon load intrinsics like vld1q_f32 before the loop.
- Use separate vector variables for each intermediate result.
- Minimize extraneous operations inside the loop.
- Use store intrinsics like vst1q_f32 to save results after the loop.
- May need to peel loops or handle partial vector loads at ends.
- Browse documentation and examples for the wide range of available intrinsics.
With appropriately vectorized code, Neon intrinsics can often provide 2-4X performance speedups compared to scalar code. Measure performance before and after to validate gains.
Conclusion
Neon intrinsics provide direct access to the SIMD capabilities of ARM processors, allowing operations on multiple data elements concurrently. Intrinsics map C/C++ functions to underlying Neon instructions. Significant performance gains can be obtained by vectorizing key loops, especially in multimedia and signal processing. ARM also provides auto-vectorization in its compiler toolchain. Following best practices for utilizing intrinsics can help unlock the full benefits of Neon.