ARM-based processors have long included SIMD instructions to improve performance for multimedia and signal processing workloads. Two key SIMD instruction sets used in ARM processors are NEON and MVE (Matrix Vector Extension). While both provide SIMD capabilities, there are some key differences between the two.
Overview of NEON
NEON is a SIMD instruction set that has been included in ARM Cortex-A series processors since the Cortex-A8 in 2006. It operates on 64-bit and 128-bit vectors, allowing operations on multiple data elements concurrently. NEON supports common data types like integers, floating point numbers, and polynomials.
Some key capabilities and features of NEON include:
- 128-bit wide SIMD with single instruction multiple data (SIMD) processing
- Supports 8, 16, 32 and 64 bit integer and single-precision (32-bit) floating-point data types
- Specialized instructions for audio and video processing, 3D graphics, speech recognition, and image processing
- Fused multiply-add instructions for better performance and precision
- Saturating arithmetic allowing overflow values to be clamped to max/min values
NEON is implemented as a coprocessor and has 32 64-bit registers that can be viewed as 16 128-bit registers known as quadword registers. NEON instructions operate on these quadword vectors stored in these registers.
Overview of MVE
The MVE (Matrix Vector Extension) is a more recent SIMD instruction set introduced in the ARMv8-M architecture for microcontrollers. It is supported in Cortex-M33 and newer Cortex-M cores.
MVE focuses on enhancing performance for machine learning workloads on microcontrollers, providing specialized instructions for common vector and matrix operations used in ML. Some key features of MVE include:
- Instructions for 8-bit and 16-bit integer matrix operations
- Supports linear algebra primitives like dot product
- Vector by scalar operations to multiply vectors by a scalar value
- Saturating arithmetic like NEON
- Permute instructions to rearrange vector elements
- Data reordering load/store instructions to optimize memory access patterns
MVE introduces a new 128-bit wide vector register file known as the Advanced SIMD and Floating-point Extension register file. This provides 32 128-bit registers that MVE instructions can operate on.
Key Differences Between NEON and MVE
While both NEON and MVE provide SIMD capabilities to ARM processors, there are some notable differences between the two architectures:
Target Workloads
NEON is designed as a general purpose SIMD engine for accelerating a wide range of media, signal processing, and computational workloads. MVE is more specialized, focused on accelerating machine learning workloads on microcontrollers.
Supported Data Types
NEON supports a wider range of integer and floating point data types including 8, 16, 32, and 64-bit integers and 32-bit single precision floats. MVE focuses on lower precision data types like 8-bit and 16-bit integers which are commonly used in machine learning models.
Vector Length
NEON uses 128-bit vector registers and instructions that can operate on different vector lengths within that 128-bit register. MVE uses fixed length 128-bit vectors.
Hardware Implementation
NEON relies on specialized 64-bit and 128-bit vector registers and execution units. MVE reuses existing 32-bit registers and hardware by performing 128-bit MVE operations over four cycles.
Target Processors
NEON is designed for high performance application processors like the Cortex-A series. MVE targets lower power microcontrollers like Cortex-M series chips.
Instruction Set
While both support common SIMD instructions, NEON has a much larger and richer set of instructions optimized for media processing. MVE instructions are more focused on machine learning primitives.
Matrix Operations
MVE provides direct support for matrix operations through its matrix load/store and matrix multiply instructions. NEON relies on using general SIMD instructions to implement matrix operations.
MVE 2
An enhanced version of MVE, called MVE2, has also been introduced. MVE2 adds additional capabilities:
- Increased maximum vector length of 256-bits
- 32 256-bit vector registers
- Native support for intrinsically safer C and C++ code
- Additional machine learning, vision, and sensor fusion primitives
Use Cases
Given their different strengths, NEON and MVE tend to be used in different domains:
NEON Use Cases
- Image, audio and video processing
- Speech recognition
- Computer vision
- Scientific computing
- 3D graphics
- Gaming
- High performance computing
MVE Use Cases
- TinyML applications like keyword spotting
- Anomaly detection
- Predictive maintenance
- Industrial IoT
- Autonomous robots
- Smart home devices
Programming and Compiler Support
Both NEON and MVE are supported by ARM’s compilers like armc and the Arm Compiler 6 toolchain. They provide auto-vectorization capabilities to automatically vectorize code using NEON/MVE as well as intrinsics to allow explicit SIMD programming.
For NEON, additional support is provided by:
- GCC’s ARM NEON intrinsics
- Clang’s NEON vector types and intrinsics
- C++ SIMD libraries like SIMD++
For MVE, the Arm C Language Extension (ACLE) provides C intrinsics that map to MVE instructions. ACLE is supported by Arm Compiler 6 and the GNU Arm Embedded Toolchain.
Performance Comparison
Some key performance differences between NEON and MVE include:
- NEON delivers higher peak compute performance given its 128-bit vectors and pipelines designed for high throughput.
- However, MVE provides better energy efficiency and performance per watt suited for power constrained devices.
- MVE reduces memory bandwidth requirements compared to NEON thanks to its matrix load/store and data reordering instructions.
- For machine learning workloads, MVE can provide up to a 4X performance increase over standalone Cortex-M cores.
So while NEON has higher absolute performance, MVE is optimized to accelerate machine learning workloads on microcontrollers efficiently.
Conclusion
In summary, the key differences between NEON and MVE are:
- NEON is a general purpose SIMD engine while MVE is optimized for ML workloads.
- NEON supports a wider range of data types while MVE focuses on low precision integers.
- NEON has a much larger instruction set while MVE instructions target ML primitives.
- NEON is designed for application processors while MVE targets microcontrollers.
So NEON and MVE complement each other, with NEON handling high performance media workloads and MVE accelerating machine learning on embedded devices. Both continue to evolve with enhancements like MVE2 to drive improved performance and efficiency across a diverse range of ARM-based systems.