SoC
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
  • Arm Cortex M3
  • Contact
Reading: ARM Neon Intrinsics
SUBSCRIBE
SoCSoC
Font ResizerAa
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Search
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Have an existing account? Sign In
Follow US
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
© S-O-C.ORG, All Rights Reserved.
Arm

ARM Neon Intrinsics

Eileen David
Last updated: September 13, 2023 2:33 am
Eileen David 7 Min Read
Share
SHARE

ARM Neon is a Single Instruction Multiple Data (SIMD) architecture extension for ARM Cortex-A series processors. It provides vector processing capabilities that allow operations to be performed on multiple data elements concurrently, greatly improving performance for multimedia and signal processing workloads.

Contents
Overview of Neon IntrinsicsUsing Neon Intrinsics in CodeIntrinsic CategoriesLoad and Store IntrinsicsArithmetic IntrinsicsComparison IntrinsicsLogical IntrinsicsPermutation IntrinsicsConversion IntrinsicsTable Lookup IntrinsicsSpecial Value IntrinsicsIntegration with ARM CompilerUsage GuidelinesConclusion

Neon intrinsics allow developers to directly access the Neon vector instructions from C and C++ code. Intrinsics are functions that map to specific machine instructions. Using Neon intrinsics can provide performance improvements over regular C/C++ code by executing vector operations instead of scalar operations.

Overview of Neon Intrinsics

The main advantages of using Neon intrinsics are:

  • Improved performance through parallelism and pipelining
  • Access to advanced SIMD instructions not available in regular C/C++
  • Ability to operate on multiple data elements in a single instruction
  • Compatible with C/C++ code for ease of integration
  • Portable across different ARM architectures that support Neon

ARM provides an extensive set of Neon intrinsics that cover various vector data types like floats, integers, and polynomials. The intrinsics provide operations like arithmetic, logical, load/store, comparison, conversion, permutation, etc.

Neon supports 128-bit wide vector registers that can fit multiple smaller data elements like 32-bit ints or 32-bit floats. This allows operations to work on multiple elements concurrently.

Using Neon Intrinsics in Code

To use Neon intrinsics in C/C++ code, the arm_neon.h header file needs to be included. This contains declarations for all the Neon intrinsic functions.


#include <arm_neon.h> 

The first step is to define Neon vector variables to hold the vector operands. Various intrinsics like float32x4_t, int32x4_t, uint8x16_t etc. can be used depending on the vector data type required.


float32x4_t va, vb, vc; //128-bit vectors holding 4 floats each

Neon intrinsic functions can then be called to perform operations on these vector variables:


vc = vaddq_f32(va, vb); // add va and vb vectors

This will execute a SIMD add operation on 4 floats concurrently by using the Neon vector unit. Various other intrinsics like vmulq_f32 (multiply), vsubq_f32 (subtract), vmaxq_f32 (maximum) etc. are available.

Vector data needs to be loaded from memory into the Neon registers before operations can be performed. The vld1q_f32 intrinsic can load a 128-bit float32x4_t vector from a memory address:

  
float32_t in_data[4];
//load data from memory into vector register
float32x4_t va = vld1q_f32(in_data); 

Results can be stored back to memory using the vst1q_f32 intrinsic:


float32_t out_data[4];
//store vector register back to memory
vst1q_f32(out_data, vc);

By using Neon intrinsics to vectorize key loops and algorithms, significant speedups can be obtained compared to scalar code.

Intrinsic Categories

The Neon intrinsics can be categorized into several groups based on functionality:

Load and Store Intrinsics

Used to move data between Neon vector registers and memory. E.g. vld1q_f32, vst1q_f32

Arithmetic Intrinsics

Perform arithmetic operations like add, subtract, multiply, divide on vector registers. E.g. vaddq_f32, vmulq_f32

Comparison Intrinsics

Compare vector registers element-wise based on less-than, greater-than etc. E.g. vcltq_f32, vcgtq_f32

Logical Intrinsics

Perform logical operations like AND, OR, XOR, NOT on vector registers. E.g. vandq_u32, vorrq_u32

Permutation Intrinsics

Ways to rearrange and swap vector elements within registers. E.g. vrev64q_f32, vzipq_f32

Conversion Intrinsics

Convert between different vector data types like int to float. E.g. vcvtq_f32_s32, vcvt_s32_f32

Table Lookup Intrinsics

Lookup vector elements in a table or map. E.g. vtbl1_s8, vqtbl1q_s8

Special Value Intrinsics

Ways to generate common special values like 0, 1, max, min etc. E.g. vdupq_n_f32, vsetq_lane_f32

There are many more Neon intrinsics available to provide optimized implementations of common functions.

Integration with ARM Compiler

The ARM Compiler toolchain supports auto-vectorization features that can convert scalar C/C++ loops into Neon intrinsics automatically. This allows compiling code to make use of Neon without directly writing intrinsics.

For example, the following simple loop:


for (int i=0; i<LEN; i++)
   c[i] = a[i] + b[i]; 

Can be auto-vectorized by enabling -O3 optimization in the ARM Compiler. The loop will be converted to use Neon intrinsics and load, add, and store operations will work on vector registers.

The compiler auto-vectorization may not work in all cases, especially if the loop contains dependencies or function calls. Intrinsics still allow finer control over vectorization.

Usage Guidelines

Here are some guidelines on using Neon intrinsics effectively:

  • Identify key loops or code sections that dominate runtime – these are good candidates for vectorization.
  • Check dependencies between loop iterations and resolve if needed.
  • Load data usingNeon load intrinsics like vld1q_f32 before the loop.
  • Use separate vector variables for each intermediate result.
  • Minimize extraneous operations inside the loop.
  • Use store intrinsics like vst1q_f32 to save results after the loop.
  • May need to peel loops or handle partial vector loads at ends.
  • Browse documentation and examples for the wide range of available intrinsics.

With appropriately vectorized code, Neon intrinsics can often provide 2-4X performance speedups compared to scalar code. Measure performance before and after to validate gains.

Conclusion

Neon intrinsics provide direct access to the SIMD capabilities of ARM processors, allowing operations on multiple data elements concurrently. Intrinsics map C/C++ functions to underlying Neon instructions. Significant performance gains can be obtained by vectorizing key loops, especially in multimedia and signal processing. ARM also provides auto-vectorization in its compiler toolchain. Following best practices for utilizing intrinsics can help unlock the full benefits of Neon.

Newsletter Form (#3)

More ARM insights right in your inbox

 


Share This Article
Facebook Twitter Email Copy Link Print
Previous Article Cortex-M4 Debug Registers
Next Article What is ARM Cortex-R4?
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

2k Followers Like
3k Followers Follow
10.1k Followers Pin
- Sponsored-
Ad image

You Might Also Like

What is ARM Cortex M0?

The ARM Cortex-M0 is a 32-bit processor core designed by…

8 Min Read

64-bit data types and operations on ARM Cortex M3

The ARM Cortex M3 is a 32-bit processor, which means…

6 Min Read

What is the difference between cortex-M1 and M3?

The Cortex-M1 and Cortex-M3 are both ARM processor cores designed…

5 Min Read

How Instructions are Fetched in Cortex M Processors?

Overview of Instruction Fetch in Cortex M The IFU fetches…

7 Min Read
SoCSoC
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
Welcome Back!

Sign in to your account