Helium vector instructions are a new set of SIMD instructions introduced in Arm Cortex-M55 that provide significant performance improvements for signal processing, machine learning, and digital signal control applications. The key benefit of Helium instructions is that they enable parallel processing of up to 16 8-bit integers or 8 16-bit integers per clock cycle on Cortex-M55 cores. This allows developers to achieve much higher performance for workloads involving vector math, matrix operations, FFTs, convolutions, and other computational tasks on Cortex-M series microcontrollers.
Overview of Helium Vector Extension
The Arm Helium technology is a new vector extension for the Cortex-M processor family. It provides a set of 128-bit wide vector registers and associated SIMD instructions that operate on these registers.
Key features of Helium include:
- 16 x 8-bit integer registers, with operations for addition, subtraction, multiplication, shifting, etc.
- 8 x 16-bit integer registers for arithmetic and logical ops
- Vector load/store instructions for efficient data transfer
- Dot product instructions for ML workloads
- Intrinsic functions for C programmers
Helium is implemented as an optional extension within the Armv8.1-M architecture (used by Cortex-M33 onwards). The first microcontroller core to support Helium is the Cortex-M55.
Compared to earlier SIMD extensions like DSP instructions, Helium offers much higher parallelism (16 8-bit ops per cycle instead of 2) and a larger register file (16 vector registers instead of 4 or 8). This dramatically boosts performance on key workloads.
Helium Vector Registers
The Helium extension provides sixteen 128-bit wide vector registers named V0-V15. These operate as:
- Sixteen 8-bit integer registers V0.8-V15.8
- Eight 16-bit integer registers V0.16-V7.16
The vector registers can be accessed as smaller register slices like 8-bit or 16-bit as needed by the instructions. The registers provide the operands for the Helium SIMD instructions that work on these registers.
In addition, there is a 4-bit saturation flag, Q flag, that controls saturation behavior of some arithmetic instructions. This allows clamping results to data type range instead of overflow/wraparound.
Helium Instruction Set
Helium provides a range of vector instructions that operate on the V registers. Key instruction categories include:
- Arithmetic – Add, subtract, multiply, absolute difference etc. Supports saturation option.
- Logical – Bitwise AND, OR, XOR, NOT etc.
- Shift – Logical and arithmetic shift left/right by immediate amount
- Dot Product – Dot product of two V register contents
- Load/Store – Load or store one V register from memory
- Table Lookup – Lookup values from a table in memory
- Permute/Zip/Uzip – Permute vector contents like transpose a matrix
By combining these instructions, most common vector and matrix operations can be implemented efficiently. The intrinsics provide higher level access to these instructions from C code.
Benefits of Helium
Here are some of the major benefits provided by the Helium vector extension to Cortex-M processors:
- Higher Performance – Up to 16 operations per cycle improves throughput for parallel workloads
- Power Efficiency – Better utilization of core resources reduces energy per operation
- Easy to use – Intrinsic functions integrate seamlessly with C/C++ code
- Small Code Size – Compact ISA implementation suitable for MCUs
- Scalable – Single architecture scales from M-profile to higher performance cores
In particular, Helium enables acceleration of:
- Digital signal processing algorithms (filtering, FFTs etc.)
- Computer vision and image processing
- Machine learning inference using neural networks
- Sensor fusion in IoT and edge devices
- Control algorithms and predictive maintenance
- Any workload involving vector/matrix math
This allows Cortex-M cores to achieve much higher throughput on these workloads while maintaining low cost and power efficiency.
Using Helium in C Code
To use the Helium instructions in C/C++ code, Arm provides a set of intrinsic functions that map directly to the Helium ISA. Some examples are:
- vhadd – Horizontal vector add
- vadd – Vector add
- vldr – Vector load
- vstr – Vector store
- vzip – Zip vectors
- vmax – Element-wise vector maximum
Here is a simple example for vector addition: #include “arm_helium.h” void add_vectors(uint8_t *res, uint8_t *a, uint8_t *b) { v8 uint8_t va = vld1(a); v8 uint8_t vb = vld1(b); v8 uint8_t vc = vadd(va, vb); vst1(res, vc); }
This loads two 8-bit integer vectors, adds them, and stores the result. The intrinsic handles the details of mapping this to the Helium ISA.
Arm also provides reference implementations of common functions like matrix multiply, FIR filters, softmax etc. built using the intrinsics. These can be used to quickly implement complex algorithms without dealing directly with intrinsics.
Processor Support
Currently, Helium vector extension is supported only in the Cortex-M55 processor announced in 2021. Cortex-M55 is the first implementation of the Armv8.1-M architecture.
Cortex-M55 combines an advanced DSP/ML accelerator with Cortex-M33 for high performance signal and data processing. The Helium unit in Cortex-M55 provides significant speedups for workloads optimized with Helium intrinsics.
Arm has stated that Helium will be adopted across the M-profile roadmap over time. So we can expect future Cortex-M cores beyond M55 to support Helium as well.
Helium is enabled through the Armv8.1-M architecture. So any Armv8.1-M compatible core can implement Helium extensions in the future.
Conclusion
In summary, Helium vector instructions provide SIMD parallel processing capabilities to Cortex-M series processors, unlocking much higher performance and efficiency. The combination of compact ISA, easy programming through intrinsics, and scalability across the M-profile family enables new applications in signal processing, computer vision, control systems and machine learning.
As Helium gets adopted in more microcontrollers, it will become a key differentiating feature for the Cortex-M processors compared to competing architectures. The ability to accelerate advanced algorithms involving vector math while maintaining deterministic real-time performance allows Arm to target a wide range of embedded applications with Cortex-M series cores.