ARM Cortex-M processors feature both instruction and data caches to improve performance by reducing accesses to slower main memory. Caches take advantage of locality principles – the tendency for programs to access the same data and instructions repeatedly over a short period of time. By storing this frequently accessed information in smaller, faster memory caches on the processor, the caches can service some memory accesses instead of making the processor wait to retrieve data from main memory. This improves performance by reducing memory access latency.
Instruction Cache
The instruction cache stores instructions that have been recently fetched from memory. When the processor needs to fetch the next instruction in a program’s execution flow, it first checks if that instruction is available in the instruction cache. If so, the instruction can be read from the faster cache instead of waiting to access main memory. This avoids stalling the processor pipeline while waiting for instructions.
The Cortex-M7 and M33 CPUs include an instruction cache, while other Cortex-M variants do not. The M7 and M33 instruction cache is 4-way set associative, meaning each cache line can store instructions from one of four possible locations in memory. It has a total size of 64KB with 16 sets and four ways per set. Each cache line stores 32 bytes (eight instructions).
When fetching instructions, the processor uses a subset of the instruction’s memory address to determine which set it maps to in the cache. It then checks the four ways in that set to see if the desired instructions are present. If not, it fetches the instructions from main memory and stores them in the cache, potentially displacing a less recently used cache line.
To maintain coherency, the instruction cache monitors the bus for writes to memory regions holding cached instructions. Any writes invalidate the affected cache lines, forcing the instructions to be re-fetched from the updated memory contents. The cache also detects instruction accesses to uncacheable memory regions and avoids caching that data.
Benefits of the Instruction Cache
The benefits of the Cortex-M instruction cache include:
- Reduces the average access latency for fetching instructions
- Avoids processor stalling while waiting for instructions from slow memory
- Makes better use of the faster bus clock speed, improving performance
- Conserves memory bandwidth since some instruction fetches are serviced from cache
Instruction Cache Performance Factors
The performance benefit of the instruction cache depends on several factors:
- Cache hit rate – The percentage of instruction fetches found in cache. A higher hit rate reduces accesses to main memory.
- Memory access latency – The cycles required to read instructions from main memory. More latency makes caching more impactful.
- Code locality – How well the program code exhibits spatial and temporal locality. This determines the potential hit rate.
- Cache coherency activity – Write invalidations and cache line flushing can reduce hit rates.
Well-structured programs executing tight loops experience high instruction cache hit rates. Large code size, many branches, and poor locality in program flow can lower hit rates. The memory system latency and bus clock ratio also influence the performance benefit.
Data Cache
In addition to the instruction cache, Cortex-M processors can also have one or more data caches. These cache memory regions are used to store data operands from memory, including global and static variables, stack data, and heap allocated objects.
The Cortex-M7 includes a unified cache that holds both instructions and data in separate regions. Cortex-M4 and M33 have independent instruction and data caches. Other variants like the M3 lack data caching support.
Much like the instruction cache, the processor checks for desired data first in the data cache before accessing main memory. If found, the processor gets the data faster without waiting on the slower memory access. Data caching utilizes the temporal and spatial locality exhibited by program data accesses.
Write Policy
An important consideration for data caching is the cache’s write policy. This determines how the cache handles write operations to cached data:
- Write-through – The data is written to both the cache line and main memory. This ensures coherency but has higher latency.
- Write-back – The data is only written to the cache line. Main memory is updated later. Low latency but requires cache flushing.
Write-through has the advantage of simpler coherency management. Write-back minimizes latency but needs flushing to synchronize the cache contents back to main memory.
Data Cache Performance Factors
As with instruction caching, data cache performance depends on:
- Hit rate – Higher rates reduce accesses to main memory
- Memory latency – More cycles waiting on memory makes caching more beneficial
- Locality – Programs with good locality get higher hit rates
- Coherency – Write invalidations and flushing hurt hit rates
The processor’s data access patterns and memory system characteristics determine the potential advantage. Tight execution loops that operate on local data see the most benefit. Large working set sizes that exceed the cache capacity dilute the benefits.
Cortex-M Cache Implementation
Let’s take a closer look at how data and instruction caching is implemented on certain Cortex-M variants:
Cortex-M3/M4/M7
- M3 has no data or instruction cache
- M4 has an 8-64KB data cache but no instruction cache
- M7 has a unified 64KB cache holding both data and instructions
The Cortex-M4 data cache is 4-way set associative with a write-through policy. Its size ranges from 8-64KB with 4-16 ways. Each cache line stores 32 bytes of data. The M4 data cache reduces the cycles needed to access data variables and arrays by caching them on-chip. This avoids stalling while waiting on off-chip memory reads.
The M7 unified cache stores both instructions and data in separate regions – data accesses don’t evict cached instructions and vice versa. The 64KB cache is 4-way set associative with 32 byte cache lines. The unified caching avoids duplication between instruction and data caches. The M7 sees performance benefits from both instruction and data caching.
Cortex-M33
The Cortex-M33 includes independent instruction and data caches:
- 64KB 4-way instruction cache with 32 byte lines
- 8-64KB 4-way data cache with 32 byte lines
The split caches allow simultaneous instruction and data access. The M33 data cache uses a write-back policy to minimize latency and bus utilization. Cache lines are allocated using a pseudo least-recently-used scheme.
The independent M33 caches provide low latency access for both instructions and data. Caching both operands reduces stalls and improves pipeline throughput. The M33 sees significant performance gains from caching.
Enabling and Disabling Caches
In some cases, you may want to disable caching in Cortex-M processors to reduce hit latency and overhead. Reasons include:
- Deterministic real-time response – Caching makes access time non-deterministic
- Low-memory systems – Caching requires dedicated on-chip memory
- Data coherency concerns – Caching complicates coherency management
To disable caches:
- Disable caching in the compiler using directives like @no_cache for IAR
- Set cache enable register bits to 0 to disable specific caches
- Use memory barriers and cache maintenance operations when caching is disabled for code compatibility
Deterministic hard real-time systems often require caches to be disabled. Multicore coherence also may require cache disabling. This trades off throughput for predictability and simpler coordination.
Multicore Cache Coherency
In multiprocessor Cortex-M systems with shared memory, cache coherency is critical. Without proper coherency, different cores could end up with inconsistent views of memory contents. Steps for coherency include:
- Snooping other core’s memory transactions
- Invalidating own cache lines on writes
- Using write-through caches
- Cache line flush operations
- Barriers/fences to order accesses
The Cortex-M7 includes hardware support for dual-core cache coherency like automatic cache line invalidation on writes. Other variants require careful software management or cache disabling for coherency.
Optimizing Cache Performance
Software optimizations to improve caching performance involve maximizing locality and minimizing coherency overhead. Recommendations include:
- Organize data to maximize sequential accesses
- Improve temporal locality using loops and function calls
- Avoid unnecessary writes to cached shared data
- Use memory barriers to structure accesses
- Align data to cache line boundaries
- Minimize working set size
Toolchain options like loop unrolling and function inlining also help improve instruction cache performance. Profile guided optimization can guide data structure layout.
Conclusion
In summary, instruction and data caching in Cortex-M processors reduces access latency by exploiting locality. Larger caches with higher hit rates provide better performance. Optimized software also improves locality and caching efficiency. Appropriate coherency management ensures correctness across multiple cores. Configurable caching in Cortex-M enables trading off determinism for faster access in embedded applications.
Through caching, Cortex-M processors deliver high performance despite slower memory systems. Caching leverages on-chip memory to hide memory latency and keep the pipelines filled. For memory-intensive workloads, instruction and data caches are key to achieving high throughput in embedded ARM applications.