The short answer is yes, the Arm Cortex-M4 processor core does contain an FPU (Floating Point Unit). The Cortex-M4 FPU represents a major upgrade over previous Cortex-M cores, allowing much higher performance on floating point arithmetic operations.
Overview of Arm Cortex-M4
The Arm Cortex-M4 is a 32-bit processor core designed for microcontroller applications. It is part of Arm’s Cortex-M series of cores, which are optimized for embedded and IoT use cases where power efficiency, performance per MHz, and small silicon area are critical.
The Cortex-M4 builds on the previous M3 core by adding significant new features:
- Floating point unit (FPU) – hardware support for single-precision floating point arithmetic
- Digital signal processing (DSP) instructions – special instructions to accelerate DSP algorithms
- Optional memory protection unit (MPU) – improves software reliability and security
- Low power features – reduces energy consumption in active and sleep modes
In terms of performance, Arm claims that the Cortex-M4 achieves 1.25 DMIPS/MHz. This is approximately double the performance of the older Cortex-M3 core on typical workloads. The addition of the FPU and DSP instructions results in a major increase in performance on floating point and signal processing tasks.
Cortex-M4 Floating Point Unit
The key feature addition in Cortex-M4 is the single-precision floating point unit (FPU). The FPU can perform basic arithmetic operations – addition, subtraction, multiplication, division, and square root – on 32-bit floating point values conforming to the IEEE 754 standard.
Having hardware support for floating point math is very significant for microcontroller applications. Previously floating point workloads would have to be run entirely in software, which is much slower, less power efficient, and more complex to program compared to using a hardware FPU.
Some examples of how the Cortex-M4 FPU accelerates floating point workloads:
- Digital signal processing – filtering, transforms, audio processing etc.
- Sensor fusion algorithms – combining data from multiple sensors to track position, motion etc.
- Control systems – PID controllers, feedback control loops
- Mathematical modeling – simulations, numerical analysis
- Computer vision – image processing, recognition algorithms
Essentially any application domain that involves processing sensor data or doing mathematical modeling computations will benefit greatly from having the dedicated floating point hardware in Cortex-M4.
FPU Architecture
The Cortex-M4 FPU is an implementation of the Arm v7E-M architecture. It contains the full set of 32-bit single-precision instructions defined in the architecture.
Key details on the FPU architecture include:
- 32 x 32-bit registers for floating point data
- Pipelined architecture capable of one floating point operation per clock cycle
- Hardware support for NaN (Not a Number) and denormalized values
- Configurable rounding modes: nearest, +inf, -inf, zero
- Hardware divide square root unit
- Uses IEEE 754 format for floating point data
The FPU is integrated into the processor pipeline, meaning floating point instructions can execute concurrently with other integer instructions. This allows floating point workloads to be accelerated with minimal impact on total application performance.
One limitation is that the Cortex-M4 FPU only supports single-precision (32-bit) floating point. There is no double-precision (64-bit) support. This is a reasonable tradeoff to limit silicon area and complexity for a microcontroller application.
Programming the FPU
To utilize the FPU in Cortex-M4, there are a couple ways for developers to access the floating point instructions:
- Intrinsic functions – these are function calls that map directly to FPU instructions. Available in C/C++ code, allows easy access to FPU without writing raw assembly.
- DSP instructions – Arm defined some Cortex-M4 specific DSP instructions to improve code density of common signal processing operations.
- Assembly language – writing assembly code using the v7E-M floating point instruction set. More complex but allows full control.
Here is a simple C code snippet showing how the FPU could be used to add two floating point values using intrinsic functions: float a = 1.5f; float b = 2.3f; float sum; sum = __add(a, b); //Intrinsic maps to FPU add instruction
The compiler and runtime libraries handle setting up the FPU, saving intermediate values to the floating point registers, and preserving the application state across function calls.
Performance and Benchmarking
Compared to Cortex-M3, Arm claims the Cortex-M4 FPU enables up to 10x higher performance on floating point workloads. Realized performance gain depends heavily on the application and mix of integer vs floating point instructions.
For example, one benchmark test showed Cortex-M4 completing a floating point matrix multiplication benchmark 10 times faster than Cortex-M3. This benchmark is an ideal case to demonstrate the potential of the FPU.
Some examples comparing Cortex-M3 to M4 performance:
- FIR filter – 4x faster on M4
- FFT transform – 6x faster on M4
- Matrix multiplication – 10x faster on M4
- JPEG encoding – 2x faster on M4
Actual performance improvement on real applications depends on many factors. But in general, utilizing the FPU can provide enormous performance speedups on floating point code vs running on Cortex-M3.
Use Cases
Here are some examples of real world products using Cortex-M4 and benefitting from the integrated FPU:
- Wearables – health tracking and fitness wearables performing sensor fusion and data analysis.
- Industrial – motor control systems, power conversion, robotics applications.
- Automotive – digital signal processing, sensor processing, analytics in advanced driver assist systems (ADAS).
- Audio – digital effects, audio codec hardware, synthesizers.
- Scientific – instrumentation, data acquisition, signal analysis.
The FPU has allowed Cortex-M4 microcontrollers to be used for increasingly sophisticated embedded applications involving some signal processing or math modeling while maintaining low power consumption and cost.
Conclusion
The addition of a hardware floating point unit to the Cortex-M4 core represents a major evolutionary step for Arm’s Cortex-M series. The FPU unlocks much higher performance levels for floating point arithmetic and provides enormous benefits across a wide range of microcontroller application domains.
Leveraging Cortex-M4 FPU in place of software floating point libraries or integer-only alternatives like Cortex-M3 leads to faster, lower power, and more efficient implementation of floating point algorithms for computationally intensive workloads.
Overall the FPU addition makes Cortex-M4 ideal choice for products requiring floating point math processing while conforming to tight microcontroller constraints on cost, power, and silicon area.