Multiplying two 32-bit numbers to get a 64-bit result is a common operation in many embedded and IoT applications running on Cortex M0/M0+ chips. The default 32×32->64-bit multiplication opcode provided by ARM takes 32 clock cycles to complete, which can be slow for time-critical code. This article discusses techniques to significantly speed up 32-bit multiplication on Cortex M0/M0+ using intrinsics, assembly optimization, loop unrolling and software pipelining.

## 1. Using UMLAL Instruction

The UMLAL (Unsigned Long Multiply Accumulate) intrinsic allows multiplying two 32-bit operands and accumulating the 64-bit result with a previous 64-bit value. This completes the multiplication in just 4 clock cycles. Here is an example:

```
uint64_t result = 0;
uint32_t op1 = ...;
uint32_t op2 = ...;
// UMLAL performs op1 * op2 + result
result = __UMLAL(op1, op2, result);
```

The major limitation is that UMLAL requires the result to be accumulated with a previous value. If a simple 32×32->64 multiplication is needed, the result variable needs to be explicitly set to 0 before accumulation.

## 2. Using UMAAL Instruction

UMAAL (Unsigned Long Multiply Accumulate Accumulate) intrinsic multiplies two 32-bit operands and accumulates the result twice – once with each 32-bit half of a 64-bit accumulator variable. This also takes only 4 clock cycles.

```
uint64_t result = 0;
uint32_t op1 = ...;
uint32_t op2 = ...;
// UMAAL performs op1 * op2 + (hi(result) << 32) + lo(result)
result = __UMAAL(op1, op2, result);
```

The limitation is the need for the accumulator variable. Setting it to 0 allows simple 32×32->64 multiplication.

## 3. Using Assembly Language

Writing the multiplication routine in assembly language allows it to be optimized to just 3 clock cycles. The steps are:

- Load 32-bit operands into registers R0 and R1
- Execute MUL R0, R1 instruction to get 64-bit result in R0 (least significant 32 bits) and R1 (most significant 32 bits)
- Store R0 and R1 to memory location for 64-bit result

```
void mult32x32_64asm(uint32_t op1, uint32_t op2, uint64_t *result) {
__asm volatile(
" mov r0, %[op1] \n"
" mov r1, %[op2] \n"
" mul r0, r1 \n"
" str r0, [%[result], #0] \n"
" str r1, [%[result], #4] \n"
:
: [op1] "r" (op1), [op2] "r" (op2), [result] "r" (result)
);
}
```

This routine loads op1 and op2 into R0 and R1 registers, multiplies them storing 64-bit result across R0 and R1, and stores the registers to memory for result. The MUL instruction takes just 1 cycle, giving 3 cycle total latency.

## 4. Loop Unrolling

When multiplying arrays of numbers, loop unrolling can be used to improve performance. For example, this code multiplies elements of two arrays into a result array:

```
void multArrays(uint32_t *input1, uint32_t *input2, uint64_t *output, size_t size) {
for(int i=0; i<size; i++) {
output[i] = input1[i] * input2[i];
}
}
```

This loops sequentially through each multiply one by one. We can unroll the loop to perform multiple multiplies per iteration:

```
void unrolledMultArrays(uint32_t *input1, uint32_t *input2, uint64_t *output, size_t size) {
for(int i=0; i<size; i+=4) {
output[i] = input1[i] * input2[i];
output[i+1] = input1[i+1] * input2[i+1];
output[i+2] = input1[i+2] * input2[i+2];
output[i+3] = input1[i+3] * input2[i+3];
}
}
```

This unrolled version performs 4 multiplies per loop iteration instead of 1. The optimal unroll factor depends on code size constraints. Higher unroll factors increase instruction-level parallelism and speed up execution but also increase code size.

## 5. Software Pipelining

Software pipelining is a loop optimization that exploits overlapping operations from different iterations to improve performance. Consider this simplified example:

```
for(i=0; i<N; i++) {
Load data // Stage 1
Multiply data // Stage 2
Store result // Stage 3
}
```

The stages have dependencies that prevent full pipelining. But we can rearrange code to start next iteration’s load while previous iteration is still multiplying:

```
for(i=0; i<N; i++) {
Load data A // Stage 1
Multiply data B // Stage 2
Load data C // Stage 1
Store result A // Stage 3
Multiply data C // Stage 2
Load data D // Stage 1
// etc...
}
```

This shifts load operation earlier in the loop to better overlap with multiply and store. The optimal software pipeline depends on operation latencies. But it can provide substantial speedups for core loop operations.

## 6. Utilizing SIMD Instructions

Some Cortex M0+ processors support SIMD instructions that can multiply multiple data values concurrently. For example, the SMMLA instruction can perform dual 16×16 multiplies and 32-bit accumulation in a single cycle. This can compute 32×32 multiplies with 2 SMMLA instructions:

```
uint32_t op1 = ...;
uint32_t op2 = ...;
uint32x4_t op1_vec = vmovq_n_u32(op1); // splat op1 across SIMD register
uint32x4_t op2_vec = vmovq_n_u32(op2); // splat op2 across SIMD register
uint32x4_t p = vmulq_u32(op1_vec, op2_vec); // SIMD multiply
uint64_t result = vpaddlq_u32(p); //pairwise add to 64-bit result
```

This provides a fast way to multiply 32-bit operands when SIMD is available. SIMD with other data types like 16-bit and 8-bit is also possible for further gains.

## 7. Tuning for Code Size vs Performance

All the above techniques optimize for performance by reducing the number of cycles for multiply. But they often increase code size as well due to unrolling, pipelining, SIMD etc. If optimizing primarily for code size, the plain C multiply is likely the best option. But for performance-critical applications, the techniques described earlier provide substantial speedups of 2x or more in many cases. There is always a tradeoff between size and speed that must be balanced based on application requirements.

## 8. Benchmarking for Optimization

It is important to benchmark code performance before and after applying optimizations. Use timing functions or hardware performance counters to measure cycles and instructions needed for the multiplication operation. This helps quantify the actual benefits. Benchmark with different data sets and scenarios to cover corner cases. Optimization that helps in one case may not in another. Profile-guided optimization using real workload data is ideal. Avoid premature optimizations without evidence it would improve performance. Measure first, then optimize based on empirical data.

## 9. Checking for Correctness

When writing and optimizing multiplication routines, ensure correctness is not broken. For cryptographic and safety-critical applications, formally verify optimized assembly code maintains expected behavior. Exhaustively test edge cases with different data values, check for overflows and bounds, and compare results against simpler/slower reference implementations. Optimization should improve performance but not at the cost of broken functionality. Automated unit testing helps catch any regressions early.

## 10. Conclusion

Performing fast 32-bit multiplications is crucial for software efficiency on Cortex M0/M0+ microcontrollers. Developers can speed up execution using intrinsics like UMLAL/UMAAL, writing optimized assembly, loop unrolling and software pipelining. With a 3-10x cycle reduction compared to default multiplication, large performance gains are possible in practice. But always balance against code size overheads and verify functional correctness. Apply optimizations judiciously based on use cases and measured data. With the right techniques, Cortex M0/M0+ can deliver fast and efficient 32-bit multiply performance.