ARM Cortex-M processors are known for their power efficiency and high performance. One of the key architectural features that enables this combination is the presence of caches in the Cortex-M cores.
Caches are small, fast memory arrays that store frequently used data closer to the processor core. This allows the processor to access the data faster compared to loading it from the main system memory every time. Caches exploit the properties of temporal and spatial locality commonly observed in software programs.
Types of Caches in ARM Cortex-M
The ARM Cortex-M series of processors contains up to three types of caches:
- Instruction Cache
- Data Cache
- System Cache
Instruction Cache
The instruction cache stores executable instruction code fetched from flash or RAM memory. When the processor needs to fetch instructions for execution, it first checks if the instructions are available in the instruction cache. A cache hit avoids fetching from main memory and improves performance.
The instruction cache is usually 4 KB to 64 KB in size. All Cortex-M processors have instruction caching enabled by default as it provides significant performance benefits with low overhead.
Data Cache
The data cache stores data operands read from or written to the main system memory. When operating on data, the processor checks if the operands are in the data cache, avoiding main memory access on a hit.
The data cache capacity ranges from 4 KB to 128 KB in different Cortex-M variants. Data caching can optionally be enabled based on the application memory access patterns.
System Cache
The system cache is present in advanced Cortex-M processors like Cortex-M7 and M33. It caches memory-mapped registers and buffers write operations.
The system cache is useful for networking, digital signal processing and graphics applications which make heavy use of memory-mapped control and status registers.
Cache Architecture
The caches in Cortex-M processors are 4-way set associative. This means the cache is divided into sets, each containing four cache lines. Memory locations map to specific cache lines based on the index bits.
4-way associativity provides a good balance between hit rate and hardware complexity. The least recently used (LRU) algorithm is utilized for cache line replacement on a miss.
The caches are virtually indexed physically tagged (VIPT). The cache index is derived from the virtual address from the core. But the cache tag stores the physical address for matching on a lookup.
VIPT caches help reduce aliasing but require address translation for every access. Cortex-M leverages a dedicated Micro-TLB cache to accelerate address translation.
Write Policy
The Cortex-M data cache supports write-through and write-back policies configurable at runtime. In write-through, writes go to both cache and memory. Write-back buffers the writes in cache and writes back to memory later.
Write-back improves performance by preventing unnecessary memory accesses. But write-through has better real-time determinism as writes reach memory immediately.
Coherency
In multiprocessor Cortex-M implementations, the presence of separate data caches per core raises coherency concerns. Cortex-M uses the simplified Aquila interconnect protocol to maintain cache coherency.
The Aquila protocol utilizes acquisitions and broadcasts to synchronize caches. Cortex-M7 also supports hardware coherency for greater efficiency.
Optimizing Cache Performance
There are several ways software can be optimized to maximizes cache hit rates and thereby overall performance on Cortex-M processors.
Loop Optimization
Software loops tend to access a localized set of instructions and data repetitively. By minimizing the loop body size to fit in caches, constant cache hits can be achieved in tight loops.
Loop unrolling is a common optimization technique that expands the loop body inline to reduce branch penalties. But this also increases the loop body size, potentially evicting other useful information from caches.
So a balanced approach is required to optimize loops for caches without making the loop body too large.
Data Layout
How program data structures are laid out in memory impacts cache efficiency due to spatial locality. Data elements accessed together should be placed contiguously to increase the likelihood of caching multiple relevant data.
For structures like arrays, sequential access patterns allow streaming data from cache. For larger data structures that do not fit in cache, techniques like loop tiling help reuse cached data across iterations.
Alignment
Aligning data as per the cache line size ensures that memory accesses do not straddle cache line boundaries. Data alignment avoids splitting access across multiple cache lines.
Cortex-M processors support configurable cache line sizes of 16 bytes, 32 bytes or 64 bytes. Aligning critical data structures accordingly prevents unnecessary cache line splits.
Prefetching
Many Cortex-M processors support software prefetch instructions to proactively load data into caches before access. Prefetching leverages predictable access patterns to request relevant data just before use.
Prefetching has to balance bringing data into cache early enough against evicting useful information. The prefetch distance and frequency requires fine tuning for optimal performance.
Real-Time Guarantees
The use of caches raises concerns around deterministic real-time performance in time-critical Cortex-M applications.
But the Cortex-M cache architecture provides various features to enable real-time guarantees even in cached systems.
Instruction Tightening
The Micro-TLB that caches page table translations also acts as an instruction cache lookahead. It prefetches instructions from flash/RAM and filters out unwanted instructions.
This avoids caching instructions that are not actually executed, improving predictability. The translation lookaside buffer also accelerates paging and translation.
Cache Locking
Cortex-M data and instruction caches support locking critical application data and code regions in locked cache lines. This prevents unpredictable eviction of time-sensitive information.
The deterministic software can execute out of locked cache delivering reliable real-time performance. Locked data also does not require cache coherence handling.
Cache Disabling
For the most demanding real-time applications, caches can be partially or fully disabled to eliminate caching variability altogether. The flexibility of disabling caches provides a deterministic software execution option.
However, cache disabling results in reduced performance. The application must determine if losing caches provides the required level of determinism.
Conclusion
The caches in ARM’s Cortex-M processors enable high performance software on the microcontrollers. The configurable cache architecture with multiprocessor coherency support helps balance speed, determinism and efficiency for a wide range of embedded applications.
Optimizing software to maximize cache reuse, leveraging cache control features judiciously, and tuning cache policies allow tapping into the full capabilities of the Cortex-M cached system.