Single-precision (SP) floating-point instructions in Arm Cortex-M processors refer to operations that process 32-bit floating-point data according to the IEEE 754 standard. These instructions allow Cortex-M CPUs to efficiently perform mathematical calculations on fractional values needed for applications like digital signal processing, 3D graphics, and scientific computing.
IEEE 754 Single-Precision Format
The IEEE 754 standard defines the 32-bit single-precision floating-point format used by SP instructions in Cortex-M. This format dedicates 1 bit for the sign, 8 bits for the exponent, and 23 bits for the fraction. The sign bit indicates if the number is positive or negative. The exponent field represents the power of two that the fraction is multiplied by. The fraction provides precision down to the granularity of 2^-23. Overall, SP provides a reasonable balance of range and precision for many embedded applications.
SP Floating-Point Registers
Cortex-M CPUs include dedicated 32-bit single-precision floating-point registers,labeled S0-S31, to hold operands for SP calculations. For example, a fractional value can be loaded into S0, then an SP add instruction can read S0 and S1, add their contents, and store the result to S2. Some Cortex-M processors also support double-precision (DP) 64-bit registers labeled D0-D31 for operations on 64-bit doubles.
SP Floating-Point Instructions
Here are some common single-precision floating-point instructions supported by Cortex-M CPUs:
- Add/subtract: VADD, VSUB add or subtract two SP values.
- Multiply: VMUL multiplies two SP values.
- Divide: VDIV divides one SP value by another.
- Square root: VSQRT calculates the square root of an SP value.
- Compare: VCMP compares two SP values and sets status flags.
- Convert: VCVT converts between SP and DP formats.
- Negate: VNEG flips the sign bit of an SP value.
- Move: VMOV transfers SP values between registers and memory.
SP Floating-Point Status Registers
The FPSCR floating-point status register indicates exception conditions that may arise during SP calculations, such as overflow, underflow, divide-by-zero, invalid operation, etc. The FPINST2 register provides debugging information. These assist with handling edge cases when doing FP math.
SP Floating-Point Context Saving
The FPCCR register manages saving and restoring the floating-point context (registers, status, etc.) when context switching between threads or exceptions occur. This is handled automatically by hardware if the CPU supports the Floating Point Extension, or software needs to manually save/restore the context on Cortex-M0/M0+ which lack the FP Extension.
Enabling SP Floating-Point Support
Using SP floating-point on Cortex-M requires:
- Processor with Floating Point Extension – Cortex-M4, M7, M33, etc. (Not M0/M0+).
- Compiler with appropriate flags/options to generate FP code.
- Linker to include FP register definitions.
- Startup code to enable the Floating Point Extension.
This is often handled automatically by ARM’s MCU vendor SDKs. The FP extension is disabled by default for low power, so code needs to enable it explicitly.
Advantages of SP Floating-Point
Benefits of single-precision floating-point support include:
- Efficient fractional math for DSP, graphics, games, etc.
- Larger dynamic range than fixed-point.
- Avoid manual scaling/saturation arithmetic.
- Support for NaN, Inf, denormals per IEEE 754.
- Consistency across toolchains that support IEEE 754.
Limitations of SP vs DP Floating-Point
Tradeoffs versus double-precision (64-bit) to consider:
- Less precision – only 32 bits vs 64 bits.
- Smaller exponent range – 8 bits vs 11 bits.
- Increased rounding errors.
- Reduced dynamic range.
- Potential for overflow/underflow.
For applications requiring high precision, double-precision has lower quantization noise. But SP is generally sufficient for most embedded use cases while reducing storage and memory bandwidth.
Use Cases for SP Floating-Point on Cortex-M
Example applications that benefit from SP floating-point on Cortex-M processors:
- Digital signal processing (DSP): Audio processing, filters, transforms, speech recognition.
- Computer vision: Image processing, neural networks, video analytics.
- Motion control: Motor control algorithms, sensor fusion.
- Scientific computing: Numerical algorithms, simulations, modeling.
- Graphics: 3D rendering, image blending, gaming physics.
- Analytics: Statistics, data analysis, machine learning inferencing.
Pretty much any application that involves processing fractional values and needs dynamic range benefit from SP over fixed-point arithmetic.
Optimizing SP Floating-Point Performance
Tips for maximizing single-precision throughput on Cortex-M processors:
- Use DSP instructions like VDOT when possible.
- Optimize memory access patterns.
- Maximize parallelism and pipeline usage.
- Take latency of instructions into account.
- Use caches effectively to avoid stalls.
- Profile and check for bottlenecks.
With careful coding and compiler optimizations, many billions of SP floating-point operations per second (FLOPS) can be achieved even on lower-end Cortex-M cores.
Alternatives to SP Hardware Floating-Point
For Cortex-M cores without the FP extension (M0/M0+), floating-point math options include:
- Fixed-point math libraries – efficient but limited dynamic range.
- Software FP emulation – slow but provides portability.
- FPU co-processors – adds cost but offloads FP work.
- Code generation tools – automate FP to fixed-point conversion.
None match the performance and precision of dedicated SP floating-point hardware, but provide alternatives for simpler use cases.
Conclusion
Single-precision floating-point support is a key feature of higher-end Cortex-M processors like the Cortex-M4 and M7. SP instructions enable efficient fractional math while avoiding the complexity of fixed-point implementations. Applications from DSP to computer vision benefit from access to dynamic range, normalized numbers, and hardware acceleration. With careful coding, impressive floating-point performance can be achieved to enable sophisticated algorithms and numeric processing on embedded Arm cores.