The Cortex-M0+ processor from ARM is a very popular choice for embedded applications requiring digital signal processing (DSP) capabilities. With its optimized DSP instructions, the Cortex-M0+ can deliver significant performance improvements for math-intensive DSP algorithms. However, there are several techniques developers can use to further speed up DSP workloads on the Cortex-M0+.
Optimize Algorithm Implementation
The first step is to ensure your DSP algorithm implementation takes maximum advantage of the Cortex-M0+ DSP extensions. Key optimization strategies include:
- Using datatypes like int16_t and q15_t to leverage native 16-bit operations
- Minimizing data conversion and type casting
- Structured code to maximize pipeline efficiency
- Loop unrolling for parallel execution
- Manual instruction scheduling if needed
Profiling your code can help identify optimization opportunities. Even small tweaks can lead to measurable speedups due to the repetitive nature of DSP code.
Use Hardware Acceleration
The Cortex-M0+ DSP instructions work on the core CPU registers and execution units. However, transferring data between CPU and memory can become a bottleneck, especially for signal processing involving multi-dimensional buffers and frames.
Hardware acceleration using DMA (Direct Memory Access) can help overcome this bottleneck. The DMA controller can transfer data between peripherals and memory without CPU involvement. This allows the CPU to focus on number crunching.
For example, you may have input data incoming over a high-speed interface like SPI or I2S. Configuring DMA to transfer this data directly to a processing buffer allows the CPU to work in parallel. The processed output can likewise be staged via DMA for output over the interface.
Use SIMD Instructions
Single Instruction, Multiple Data (SIMD) refers to executing a single instruction on multiple data values concurrently. The Cortex-M0+ has some basic SIMD capabilities that can be exploited:
- 16-bit dual halfword instructions can execute two 8-bit operations in parallel.
- 32-bit instructions can perform four 8-bit operations in parallel.
- Dual load/store instructions for simultaneously accessing two halfwords.
This works well for DSP algorithms applying the same operation across arrays. For example, you can implement a simple vector dot product efficiently using SIMD instructions.
Optimize Data Layout
Memory access patterns can significantly impact performance on microcontrollers. Where possible, optimize data layout to exploit the natural data access width of the Cortex-M0+.
For example, structure data as arrays of 16-bit or 32-bit quantities instead of 8-bit. Accessing 16 or 32-bits naturally matches the register width. Also, align data structures to minimize non-aligned accesses.
Sometimes transposing multidimensional buffers can optimize data access. Caching or double buffering techniques may help too. The key is profiling and understanding the memory access hotspots.
Use Tightly Coupled Memory
The Cortex-M0+ allows configuration of some on-chip SRAM as Tightly Coupled Memory (TCM). TCM has very low access latency and high bandwidth compared to external memory.
Placing performance critical code and data, like inner loop variables, in TCM provides a big speedup. Up to 64KB of TCM is possible. Make sure to use compiler intrinsics to place specific code/data sections into TCM.
Tune Clock Speed vs. Voltage
Higher clock speeds naturally increase DSP performance on Cortex-M0+. However, the clock speed is limited by the supply voltage level.
Typical voltage levels are 1.8V or 3.3V. At 3.3V, the Cortex-M0+ can often run at 50-100MHz. Scaling the voltage down to 1.8V may limit speed to 30-50MHz.
The power reduction from lower voltage may be worth the clock speed tradeoff. But for max DSP performance, operating at 3.3V is better. Designers should evaluate their specific speed, power and thermal requirements.
Choose Compiler Optimization Flags
Compiler optimization flags control how aggressively the toolchain tries to optimize the generated machine code. Important flags for optimizing DSP code on Cortex-M0+ include:
- -O3 for maximum speed optimization
- -ffast-math to improve math routines
- -funroll-loops to unroll tight loops
- -mcpu to target the specific Cortex-M0+ r1p0
- -mthumb to force Thumb instruction set
- -mthumb-interwork for subroutine calls
The best set of flags depends on your specific compiler and use case. Profiling with different flag combinations provides empirical data.
Avoid Floating Point
The Cortex-M0+ does not have a floating point unit (FPU). While the compiler can synthesize floating point operations using software libraries, this comes at a significant performance cost.
DSP algorithms should use fixed point arithmetic where possible. Int16 and q15 fractional data types work well for many DSP use cases. Avoid excessive type conversions and stick to integers/fractions within the algorithm.
Use Assembly Language Selectively
The Cortex-M0+ processor and Thumb-2 instruction set are designed for high efficiency C code. However, the compiler may not always generate optimal machine code, especially for inner loop DSP operations.
Judicious use of inline Assembly language can selectively optimize hotspots and improve performance. For example, loop unrolling, instruction scheduling, SIMD coding are cases where Assembly language may help.
But Assembly code should be used sparingly, as it can impede code maintenance. Focus only on the most performance sensitive code segments after careful profiling.
Conclusion
The Cortex-M0+ offers a versatile DSP-capable processor for cost-sensitive and power-constrained embedded applications. Developers can realize further benefits by carefully applying these optimization techniques for DSP workloads. The improvements add up quickly when amortized over repetitive signal processing operations.
A combination of an efficient algorithm, optimized C code, selective Assembly coding, proper memory layout, and intelligent compiler flags can make a big difference. Additional speedups come from hardware acceleration and tuning the clock frequency and voltage.
With some performance tuning effort, the Cortex-M0+ is capable of delivering impressive DSP throughput. The methods outlined here provide a blueprint for developers looking to speed up processing and get the most out of the Cortex-M0+ architecture.