Adding a MULH Instruction to the Cortex-M0+ for Performance

Adding a hardware multiplier unit and MULH instruction to the Cortex-M0+ can significantly improve performance for applications that perform many multiplications on 16-bit values. While software multiplication is possible on the Cortex-M0+, it is much slower than using a hardware multiplier. The MULH instruction allows retrieving the upper 16 bits of a 32-bit multiply result, enabling 16×16->32-bit multiplications directly in the core pipeline.

Contents

Background on the Cortex-M0+Benefits of Adding a Hardware Multiplier Tradeoffs of Adding a Multiplier Implementation of a Hardware Multiplier and MULH Software and Compiler Impact Performance Estimates Verification Approach Power Analysis Area Analysis Schedule and Risks Cost Analysis Conclusion

Background on the Cortex-M0+

The Cortex-M0+ is a 32-bit ARM processor optimized for low cost and low power embedded applications. It has a simplified 3-stage pipeline compared to other Cortex-M cores, and does not include a hardware multiplier unit. Integer multiplication operations must be performed in software using repeated addition, shifts, and adds. This takes many more clock cycles than a hardware multiplier.

The Cortex-M0+ is commonly used in cost-sensitive microcontroller applications where performance is not the primary concern. However, there are many applications where improved multiplier performance can be beneficial, such as digital signal processing, graphics, and encryption algorithms. Adding a hardware multiplier can accelerate these workloads.

Benefits of Adding a Hardware Multiplier

Adding a hardware multiplier unit and MULH instruction to the Cortex-M0+ provides several key benefits:

Faster 16×16 bit multiplications – Hardware multipliers perform a 32-bit multiply in 1 clock cycle compared to tens of cycles for software
Improved DSP and graphics performance – Multiplication is key for digital signal processing and graphics algorithms

Accelerates encryption – Many encryption algorithms use repeated multiply operations
Similar to Cortex-M3/M4 – Provides feature parity with higher-end Cortex-M cores
Easier migration to Cortex-M3/M4 – Consistent multiplier architecture enables easier code migration

Based on industry benchmarks, a hardware multiplier can provide up to 10-100x better performance on various DSP and encryption workloads. For microcontroller applications doing significant math, this speedup is very significant.

Tradeoffs of Adding a Multiplier

While there are significant performance benefits to adding a multiplier, there are also some tradeoffs to consider from cost, power, and complexity perspectives:

Increased silicon area – A hardware multiplier takes up more die space which increases cost

Higher power consumption – Multiplier consumes more active and idle power
Design complexity – Changes to the core pipeline are more complex to implement
Potentially longer clock cycle – Adding a multiplier can lengthen the critical path and lower max clock speed

These downsides must be weighed against the performance benefit for the intended workloads. If most applications are not multiplier-heavy, the downsides may outweigh the pros.

Implementation of a Hardware Multiplier and MULH

Here is a high-level overview of how a hardware multiplier unit and MULH instruction can be added to the Cortex-M0+ pipeline:

Add 32×32 multiplier structure – This performs the full 32-bit multiply operation in one cycle.

Connect multiplier to pipeline registers – This feeds operands in and routes results.
Add MULH instruction – Encodes the op-code for retrieving upper 16-bits of result.
Merge multiplier result with Execute stage – MULH result merged with main data path.

Stalls and forwarding – Add stalls/forwarding to handle data hazards.
Modify EXE3 stage timing – May require increasing cycle time for longer critical path.

Key implementation considerations include:

Multiplier structure – Booth encoded array multiplier is likely optimal.
Operand forwarding – Need to forward operands from ID/EX registers if dependent multiply instructions are issued back-to-back..
Stall cycles – Stalls may be required for load-use hazard if operand not forwarded.

Timing path – The 32×32 multiply and MULH merge point may increase critical path delay.
Test methodology – Critical to have vector-based test approach to fully verify multiplier covering all edge cases.

There are also options regarding how broadly the multiplier is enabled. To optimize area and power, it may make sense to only enable the multiplier in specific configurations rather than always having it instantiated.

Software and Compiler Impact

Adding a MULH instruction to the Cortex-M0+ will require updates to the GCC compiler and software libraries to fully leverage the new capability:

GCC compiler – Generate MULH instruction in output assembly
Linker scripts – Update based on new instruction encodings

Startup code – May need to modify based on changes to registers used
CMSIS headers – Define MULH instruction and encoding
Libraries – Update key functions (e.g. math, DSP, crypto) to use MULH

Debuggers – Add support for MULH register usage and step-through debugging

The application software itself may also need changes to directly call the MULH instruction vs. previous software multiply routines. This would provide the biggest performance gains.

Compiler optimizations around multiply operations may also be improved. For example, better strength reduction optimizations and more aggressive loop unrolling when hardware multiplier is available.

Performance Estimates

Based on other Cortex-M cores with similar hardware multipliers, we can estimate the potential performance improvement:

32-bit multiply in 1 cycle vs. ~32 cycles for software multiply
Up to 10-100X faster on various DSP workloads

2-4X faster encryption like AES, SHA-1, etc.
1.5-2X faster various benchmarks and application code with multiply usage

Exact speedup will depend heavily on the specific workload mix and how effectively the MULH instruction is utilized. Performance should be modeled across target application spaces to determine expected speedup.

Verification Approach

Verifying an additional hardware unit and new instruction requires extensive testing to ensure correct behavior across devices, process, voltage and temperature variation, and over lifetime degradation.

Recommended verification approach:

Unit testing – Exhaustive vector testbench covering all multiplier input combinations.

Random constrained testing – Fuzz testing using constrained random test vectors.
Application testbenches – Run full application code through cycle-accurate simulator.
Gate-level testing – Verify multiply behavior after gate-level netlist implementation.

Silicon bring-up testing – Dedicated production test coverage for multiplier and MULH.
Lifetime testing – Model impact of device aging effects on multiplier over lifetime.

Testing should target not only functional correctness but also full rated speed operation across voltage and temperature. Analog mixed-signal testing may also be required to ensure no coupling into sensitive analog blocks.

Power Analysis

The multiplier and MULH instruction will increase power consumption both dynamically and leakage:

Dynamic power will increase due to higher toggling activity during multiplies.
Multiplier contributes direct leakage power from large gate count.

May enable higher dynamic power system modes due to increased throughput.

Active power will depend on utilization – more frequent usage of the multiplier leads to higher dynamic power. Leakage is incurred even when multiplier is not active.

To estimate impact, power models should be used to analyze typical application use cases. This will provide an estimate of the average increase in active power. Leakage power can be modeled based on the multiplier gate count and memory bit cells.

Power gating may be warranted to completely shut off the multiplier in low power modes. This avoids leakage when the multiplier is idle.

Area Analysis

The multiplier hardware will increase the total core area. At a high level, the area increase includes:

32×32 multiplier datapath logic

Operand muxes and result registers
MULH merge logic into main data path
Added stall/forwarding logic

Rough estimates based on synthesis and layout of similar multipliers indicates approximately 15-20% increase in total core area. The impact increases if the multiplier is always instantiated rather than conditionally enabled.

Area analysis should be performed on placed and routed layout. This will provide an accurate estimate of total area increase, which can then be weighed against the performance and power impact.

Schedule and Risks

A rough schedule for adding MulH support:

Design implementation – 6 weeks
Verification – 8 weeks
Software enablement – 4 weeks

Qualification and release – 6 weeks

Total predicted timeline of 24 weeks. Schedule risks include:

Additional pipeline stalls required – could extend implementation timeline.

Critical path timing closure issues – may require pipeline redesign.
Simulation performance – long verification times impact schedule.
New silicon spin – fixes may require new test chip if bugs found in qualification.

Schedule risks can be mitigated by starting verification earlier, budgeting contingency time, and re-using appropriate verification environments.

Cost Analysis

The cost impact of adding MulH support includes:

Engineering time – 4-6 engineer years for implementation through qualification

Mask and silicon costs – approximately $2M for test chip fabrication
Tool license costs – increased simulator capacity required

Total incremental cost estimated around $3M. Cost may be higher if a new mask spin is required.

These costs would be amortized across units shipped over product lifetime. So the percent cost increase per unit depends on total volume.

Conclusion

Adding a hardware multiplier unit and MULH instruction is a significant enhancement for the Cortex-M0+ in applications requiring high math performance. Detailed analysis is required to determine if benefits outweigh costs for any particular product. But for demanding workloads, this change can enable up to 10-100X multiplication speedup to meet growing performance needs.

Adding a MULH Instruction to the Cortex-M0+ for Performance

Background on the Cortex-M0+

Benefits of Adding a Hardware Multiplier

Tradeoffs of Adding a Multiplier

Implementation of a Hardware Multiplier and MULH

Software and Compiler Impact

Performance Estimates

Verification Approach

Power Analysis

Area Analysis

Schedule and Risks

Cost Analysis

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

ARM Cortex M Cache

ARM Program Status Registers

What is the endianness of arm cortex M33?

How to Implement a Loop Position Independent in ARM Cortex-M0+?