The ARM Cortex-M55 is the latest and most advanced processor in ARM’s Cortex-M series of embedded, IoT and MCU-focused processor cores. The Cortex-M55 builds upon the previous generation Cortex-M33 processor and brings new capabilities and performance specifically aimed at AI and ML workloads in embedded and edge devices.
Overview and Target Applications
The Cortex-M55 is designed for use in AI-enabled embedded and IoT applications where low power and high efficiency are critical. This includes areas such as:
- Industrial automation and robotics
- Automotive advanced driver assistance systems (ADAS) and autonomous vehicles
- Smart homes/buildings/cities
- Wearables and hearables
- Retail analytics and surveillance
The Cortex-M55 aims to bring new levels of machine learning capability to resource constrained edge devices, enabling more responsive and intelligent behavior without having to rely solely on the cloud. Its specialized microarchitecture is optimized to deliver up to 5x better performance per MHz on ML workloads compared to the previous Cortex-M33 processor.
Key Features
Some of the key features and capabilities of the ARM Cortex-M55 processor include:
- Helium Vector Extension (HVX) – A new 128-bit SIMD instruction set extension designed specifically for heavy parallel workloads like ML/AI. It delivers significant gains on vectorized math operations.
- DSP Extension – Enhancements to the digital signal processing (DSP) instruction set for improved scalar math performance.
- M55 Memory System – Optimized system architecture with tightly coupled memory (TCM) to maximize data throughput for ML workloads.
- Enhanced MPU – Added memory protection unit (MPU) capabilities for improved software isolation and security.
- TrustZone-M – ARM’s hardware-based security solution for Cortex-M devices is enhanced with even more features.
- Floating Point Unit – Supports single and double precision floating point calculations.
- DSP+FP Architectural Pairing – Allows floating point and DSP instructions to be issued simultaneously for improved scalar math performance.
- Wake-up Interrupt Controller – Reduces latency and power consumption when entering active mode.
- System Error Correction Codes – Detects and corrects single bit errors in memories and bus transactions.
- Enhanced Debug – Updates to embedded trace macrocell and micro trace buffer for more effective debugging.
Microarchitecture
The Cortex-M55 implements a dual-issue superscalar pipeline alongside the vector processing capabilities. This enables simultaneous issuing of certain instructions types, including:
- Issuing an HVX instruction with a scalar ALU instruction
- Issuing a DSP multiply with a scalar ALU operation
- Issuing a scalar ALU op with a scalar ALU op
- Issuing a scalar ALU with a load/store
- Issuing a DSP multiply with a load/store
The microarchitecture incorporates branch prediction and prefetching techniques to optimize instruction throughput. 2-way instruction cache helps ensure steady code execution, while 2-way data cache enables fast data access.
The M55 can dynamically adapt between high performance modes and lightweight modes optimized for low power depending on workload. Multiple low power states are available to gate clocks and cut power to unused sections of the chip.
Helium Technology
The headline feature of the Cortex-M55 is the new Helium vector processing technology. The key components of Helium include:
- Vector ALUs – SIMD execution units that can perform mathematical vector operations on up to 128 bits per cycle.
- Vector Register File – Holds vector operands and results during processing.
- Vector Memory Load/Store Units – Transfers vector data between main memory and the registers.
- Permutation Unit – Allows re-ordering of vector data elements for flexibility.
- Reduction Unit – Accumulates partial vector results.
This vector architecture is designed to accelerate ML workloads by enabling more parallel execution on the types of math found in neural networks and signal processing algorithms.
Helium supports 8, 16 and 32-bit integer formats as well as 16-bit floating point format for vectors. Special widening instructions allow smaller integer types to be efficiently packed into larger vectors.
The Helium extension provides a comprehensive set of instructions for ML acceleration, including:
- Vector arithmetic (add, subtract, multiply, shift, compare, etc.)
- Vector load and store (aligned/unaligned, with optional post-increment)
- Vector reduction (sum, minimum, maximum, etc.)
- Vector shuffling/permutation
- Vector comparison and thresholding
- Vector multiplication with scalar
- Vector widening and narrowing
DSP and Floating Point
Alongside Helium, the Cortex-M55 maintains and improves ARM’s DSP and floating point capabilities for Cortex-M class processors. This allows non-vector math to also benefit from greater parallelism and throughput.
The DSP extension provides single-cycle 16×16 and 32×32 bit multiplies with 32-bit and 64-bit accumulators respectively. Proven ARMv7-M Thumb DSP instructions are used along with enhancements added in the Cortex-M33.
The floating point unit (FPU) has been upgraded to allow simultaneous issue and execution of scalar DSP and floating point instructions – a unique feature called DSP+FP architectural pairing. This boosts performance for algorithms using both types of math.
The FPU supports both single precision (32-bit) and double precision (64-bit) operations. Advanced SIMD instructions are also supported for vector floating point on the FPU.
Performance
ARM claims the Cortex-M55 delivers up to 15x better AI performance than previous Cortex-M class processors like the Cortex-M33 and M4. Exact gains will depend on workload, but on key ML benchmark tests it has shown:
- 5-15x higher recurrent neural network performance
- 10-15x faster large convolutional neural networks
- 6-8x faster small convolutional neural networks
- 5-20x better deep neural network performance
The dual issue pipeline enables up to 30% better scalar processing performance compared to the Cortex-M33. The M55 also benefits from ARM’s most energy efficient processor design, delivering the highest performance per MHz per mW.
Overall, the advances in the Cortex-M55 promise to enable more localized ML inferencing directly on low power embedded devices rather than relying on the cloud.
Development Tools and Software
To support developers working with the Cortex-M55, ARM offers an enhanced CMSIS-NN software library for neural network workloads. This provides over 100 kernel functions to maximize Helium utilization.
The ARM Compute Library is also available with additional functions to accelerate ML on Cortex-M processors using both Helium and ARM NEON SIMD instructions.
Development tools include compiler support in ARM Compiler 6, Keil MDK toolkit and IAR Embedded Workbench. Debug and trace capabilities are enabled through ARM CoreSight debug and trace IP.
To simplify software development across the Cortex-M series, code written for previous generations like Cortex-M33 and M4 will work on M55 without modification. This helps accelerate migration to the new architecture.
Licensing and Availability
The Cortex-M55 processor is available for licensing now directly from ARM. Lead partners and early access customers include NXP Semiconductors, STMicroelectronics and Silicon Labs.
NXP plans to use the M55 in a range of automotive, industrial and IoT applications. STMicroelectronics will combine Helium with their AI accelerator hardware for smart embedded systems. Silicon Labs is developing solutions for battery-powered IoT endpoints.
Expect ARM Cortex-M55 processor IP to start appearing in commercial chips and products over the next year or so as new edge AI capabilities get deployed across a diverse range of markets.