The ARM Floating Point Unit (FPU) provides hardware support for calculations using floating point numbers. The FPU instruction set allows ARM processors to perform mathematical operations efficiently on single precision and double precision floating point values.
Overview of ARM FPU
The ARM FPU is an optional extension to the ARM instruction set architecture. It provides hardware acceleration for floating point arithmetic, which improves performance compared to doing the computations in software. The FPU operates concurrently with the ARM integer processing pipeline, allowing floating point and integer instructions to execute simultaneously.
There have been several generations of ARM FPU designs over the years. Early implementations focused on single precision (32-bit) floating point, while more recent versions also include double precision (64-bit) capabilities:
- VFP (Vector Floating Point) – Single precision only
- VFPv2 – Single and double precision
- VFPv3 – Enhanced version of VFPv2
- VFPv4 – Further improvements, ARMv7 architecture
- FPv5 – Latest implementation, ARMv8 architecture
The FPU registers are separate from the ARM general purpose registers. There are 32 single precision registers (s0-s31) and 32 double precision registers (d0-d31) in a standard VFP implementation. Registers s0-s15 overlay d0-d15 for improved performance when mixing single and double precision code.
FPU Data Types
The ARM FPU supports the following floating point data types:
- Single precision (32-bit) – Uses the IEEE 754 single precision format. Occupies one FPU register.
- Double precision (64-bit) – Uses the IEEE 754 double precision format. Occupies two FPU registers.
Floating point values are stored in the FPU registers in a modular format composed of:
- Sign bit – 1 bit determining positive or negative value.
- Exponent – 8 bits representing the exponent offset by a bias.
- Mantissa – 23 bits of precision for single precision, 52 bits for double.
This optimized format allows a wide range of values to be represented efficiently in the FPU registers.
FPU Instructions
The ARM FPU instructions can be grouped into several categories:
Data Transfer
Move data between FPU and ARM registers:
- FLDMX – Load FPU multiple registers from memory
- FSTMX – Store FPU multiple registers to memory
- FMRX – Move ARM register to FPU register
- FMRX – Move FPU register to ARM register
Arithmetic
Basic arithmetic operations:
- FADD – Floating point add
- FSUB – Floating point subtract
- FMUL – Floating point multiply
- FDIV – Floating point divide
- FSQRT – Floating point square root
Comparison
Compare floating point values:
- FCMP – Floating point compare
- FCMPE – Floating point compare with exception
- FCMPZ – Floating point compare with zero
- FCMPEZ – Floating point compare with zero and exception
These set status flags that can be tested by conditional instructions.
Conversion
Convert between data types:
- FTOSI – Floating point to signed integer
- FTOUI – Floating point to unsigned integer
- FSITO – Signed integer to floating point
- FUITO – Unsigned integer to floating point
- FTOSID – Floating point to signed integer with rounding
- FTOUID – Floating point to unsigned integer with rounding
Status and Control
Manage FPU status flags and control modes:
- FMXR – Move FPU flags to general purpose register
- FMRX – Move general purpose register to FPU flags
- FMSR – Move FPU status register to general purpose register
- FMRS – Move general purpose register to FPU status register
Programming with the FPU
Here are some key aspects to keep in mind when coding with the ARM FPU:
- The FPU can operate in parallel with the integer pipeline for optimal performance.
- Plan data transfers to minimize stalls – load data before it is needed.
- Maximize throughput by scheduling FPU and integer instructions together.
- Pay attention to data dependencies and pipeline stalls.
- Use FPU-specific status flags to optimize conditional code.
- Enable flush-to-zero and default NaN modes for optimized computations.
- Allocate variables to appropriate precision to balance performance and precision.
Proper use of the FPU can provide huge performance gains for floating point intensive code. Applications such as 3D graphics, scientific computing, statistics, and digital signal processing benefit greatly from hardware accelerated floating point arithmetic.
ARM FPU Architectures
There have been several generations of ARM FPU implementations over time. Key enhancements include:
VFP (Vector Floating Point)
- Initial ARM FPU design introduced in ARMv5 architecture.
- Provided basic single precision floating point support.
- 32 x 32-bit single precision registers.
- Pipelined for high throughput.
- Included in some Cortex-A series processors.
VFPv2
- Introduced in ARMv6 architecture.
- Added double precision capabilities.
- 32 x 32-bit single precision registers.
- 32 x 64-bit double precision registers.
- Improved pipelining and multi-processing.
VFPv3 / VFPv4
- Evolutionary improvements over VFPv2.
- Faster context switching and register access.
- Enhanced SIMD processing with 32 doubleword registers.
- More execution units for higher throughput.
- Included in Cortex-A5 and newer processor cores.
FPv5
- Latest FPU in ARMv8 64-bit architecture.
- Fully IEEE 754-2008 compliant.
- Improved performance for scalar and SIMD code.
- Cryptography extensions.
- In Cortex-A35, A53, A55 and newer 64-bit cores.
Each FPU generation expanded the capabilities and performance of floating point computation on ARM chips. The evolution continues as ARM adds new instructions and capabilities to support emerging workloads.
Summary
The ARM floating point unit provides hardware acceleration for mathematical calculations using single and double precision floating point values. Its specialized FPU registers and pipelined execution improve performance substantially over integer only implementations. Proper utilization of the FPU instruction set and data types can greatly speed up code involving complex math, 3D graphics, signal processing, and scientific computations.