Yes, ARM Cortex-M processors are capable of performing division operations in hardware. While earlier Cortex-M0 and Cortex-M0+ processors lacked a hardware divider, Cortex-M3 and newer Cortex-M processors include integer division instructions that can divide 32-bit values efficiently using a single cycle hardware divider unit.
Hardware division support in Cortex-M processors
The Cortex-M processor series is ARM’s lineup of 32-bit microcontroller chips aimed at embedded and IoT applications. The Cortex-M0, introduced in 2009, was ARM’s first Cortex-M design and their simplest processor, focused on cost-sensitive applications. To reduce chip area and complexity, the Cortex-M0 did not include a hardware divider unit. Integer division had to be implemented in software using repeated subtraction, which was slow compared to a single-cycle hardware divider.
The next-generation Cortex-M0+, released in 2012, also lacked a hardware divider to maintain simplicity and low cost. But ARM recognized the need for better division performance in some applications, so introduced the Cortex-M3 design in 2004 which was the first Cortex-M processor with integer division instructions and a hardware divider unit. The Cortex-M3 could perform a 32-bit divide in a single cycle, greatly improving performance over the software-based division of Cortex-M0/M0+.
Since the Cortex-M3, all newer Cortex-M processors including the popular Cortex-M4 and Cortex-M7 have included hardware divider units and support integer divide instructions like SDIV and UDIV. The divider units in modern Cortex-M cores can perform 32-bit divides with single-cycle latency, treating division like any other arithmetic instruction. This has made integer division much more practical to use on Cortex-M chips compared to earlier microcontrollers lacking hardware division.
Division instruction support
The ARMv6-M and ARMv7-M architectures used by Cortex-M0/M0+ and Cortex-M3+ processors provide several division instructions that utilize the hardware divider when present:
- SDIV: Signed divide of two 32-bit integers, giving a 32-bit quotient.
- UDIV: Unsigned divide of two 32-bit integers, giving a 32-bit quotient.
- SDIVS: Signed divide of a 64-bit integer (in two 32-bit registers) by a 32-bit integer, giving a 32-bit quotient.
- UDIVS: Unsigned version of the 64-bit divide.
The compiler will use these division instructions when compiling code for Cortex-M cores with hardware divider units. So any C code using normal division operators like / or % will automatically utilize the hardware division on supported cores. Inline assembly can also use SDIV, UDIV and related instructions directly.
Division speed and performance
With hardware division support, Cortex-M processors can perform integer divides with a single cycle latency. This offers a large speed boost over the software-based sequential subtraction algorithm used on earlier Cortex-M0 chips. Some comparative examples:
- Cortex-M0: 32-bit division takes ~32 cycles without hardware divider.
- Cortex-M3: 32-bit division takes just 1 cycle with hardware divider.
- Cortex-M4: 32-bit division takes 1 cycle with hardware divider.
- Cortex-M7: 32-bit division takes 1 cycle with hardware divider.
The addition of a hardware divider unit does not impact the cycle time or instruction set latency of the processor in any other way. But by reducing divide latency from dozens of cycles down to a single cycle, it makes previously expensive division operations very fast and far more viable to use in code. Applications that need to perform lots of integer divides see a huge performance benefit on Cortex-M3 and newer cores.
However, one cycle division latency still has limitations. Long division chains may incur pipeline stalls if subsequent instructions depend on the result of a divide before it is ready. So for very high throughput division, approaches like using double-precision divides or performing multiple sequential divides may be needed. But for most applications, the single-cycle hardware divider offers plenty of performance.
Division hardware implementation
ARM utilizes several techniques to implement fast integer division efficiently in hardware:
- Radix-2 SRT division – Single-cycle divides are achieved using a radix-2 SRT (Sweeney, Robertson, Tocher) divider design. This performs iterative subtraction and shifting in an efficient pipeline to determine the quotient bits.
- Lookup tables – Small lookup tables help accelerate the divide algorithm by storing precomputed results.
- Booth encoding – The dividend operand is Booth encoded to simplify the logic of subtracting the divisor.
- Pipelining – The hardware divider unit is internally pipelined to maintain high clock speeds.
On newer Cortex-M processors like Cortex-M33, the divider may also be integrated with the CPU register file to avoid accessing a separate dedicated register file for division operands. Overall, the hardware divider utilizes techniques like pipelining and parallelism to deliver low 1-cycle latency while minimizing impact on chip size and power consumption.
Division by constant values
One optimization Cortex-M compilers can apply is transforming divisions by constant divisors into faster operations. For example, dividing by a power of 2 is just a right shift. Dividing by small constants can also be transformed into a multiply by a reciprocal and shift. These tricks optimize performance of code with known divisor values.
ARM also supports fused multiply-accumulate instructions that can perform a multiply and accumulate in one cycle. Tools like CMSIS-DSP provide optimized divide functions for Cortex-M that utilize hardware acceleration instructions like fused MACs where possible.
Hardware versus software division tradeoffs
The inclusion of hardware divider units in Cortex-M3 and later processors provides a large speed boost for division. But there are some tradeoffs to consider:
- Chip area – Hardware dividers require more gates which increases chip size and cost.
- Power consumption – Additional hardware units can increase power draw.
- Suitability – Some very cost-sensitive applications may not require fast division.
So Cortex-M0/M0+ omit dividers to optimize for space and power efficiency in simple embedded applications. But most applications benefit from the faster divide performance of newer Cortex-M cores with hardware division support.
Enabling optimal division performance
To leverage the ARM divider hardware and achieve best division performance on Cortex-M processors:
- Use Cortex-M3 or newer cores which include hardware dividers.
- Use native division operators like / and % which compile to SDIV and UDIV.
- Avoid long dependency chains and pipeline stalls where possible.
- Optimize constant divisors into shifts and multiplies.
- Use CMSIS-DSP for high-performance divide functions.
- Consider double-precision divides if very high throughput is needed.
With proper coding practices and compiler setup, developers can take advantage of the fast single-cycle integer division supported on modern Cortex-M processors in most embedded and IoT applications.