Instruction and Data Caches in ARM Cortex-M (Explained)

ARM Cortex-M processors feature both instruction and data caches to improve performance by reducing accesses to slower main memory. Caches take advantage of locality principles – the tendency for programs to access the same data and instructions repeatedly over a short period of time. By storing this frequently accessed information in smaller, faster memory caches on the processor, the caches can service some memory accesses instead of making the processor wait to retrieve data from main memory. This improves performance by reducing memory access latency.

Contents

Instruction Cache Benefits of the Instruction Cache Instruction Cache Performance Factors Data Cache Write Policy Data Cache Performance Factors Cortex-M Cache Implementation Cortex-M3/M4/M7 Cortex-M33 Enabling and Disabling Caches Multicore Cache Coherency Optimizing Cache Performance Conclusion

Instruction Cache

The instruction cache stores instructions that have been recently fetched from memory. When the processor needs to fetch the next instruction in a program’s execution flow, it first checks if that instruction is available in the instruction cache. If so, the instruction can be read from the faster cache instead of waiting to access main memory. This avoids stalling the processor pipeline while waiting for instructions.

The Cortex-M7 and M33 CPUs include an instruction cache, while other Cortex-M variants do not. The M7 and M33 instruction cache is 4-way set associative, meaning each cache line can store instructions from one of four possible locations in memory. It has a total size of 64KB with 16 sets and four ways per set. Each cache line stores 32 bytes (eight instructions).

When fetching instructions, the processor uses a subset of the instruction’s memory address to determine which set it maps to in the cache. It then checks the four ways in that set to see if the desired instructions are present. If not, it fetches the instructions from main memory and stores them in the cache, potentially displacing a less recently used cache line.

To maintain coherency, the instruction cache monitors the bus for writes to memory regions holding cached instructions. Any writes invalidate the affected cache lines, forcing the instructions to be re-fetched from the updated memory contents. The cache also detects instruction accesses to uncacheable memory regions and avoids caching that data.

Benefits of the Instruction Cache

The benefits of the Cortex-M instruction cache include:

Reduces the average access latency for fetching instructions
Avoids processor stalling while waiting for instructions from slow memory
Makes better use of the faster bus clock speed, improving performance

Conserves memory bandwidth since some instruction fetches are serviced from cache

Instruction Cache Performance Factors

The performance benefit of the instruction cache depends on several factors:

Cache hit rate – The percentage of instruction fetches found in cache. A higher hit rate reduces accesses to main memory.

Memory access latency – The cycles required to read instructions from main memory. More latency makes caching more impactful.
Code locality – How well the program code exhibits spatial and temporal locality. This determines the potential hit rate.
Cache coherency activity – Write invalidations and cache line flushing can reduce hit rates.

Well-structured programs executing tight loops experience high instruction cache hit rates. Large code size, many branches, and poor locality in program flow can lower hit rates. The memory system latency and bus clock ratio also influence the performance benefit.

Data Cache

In addition to the instruction cache, Cortex-M processors can also have one or more data caches. These cache memory regions are used to store data operands from memory, including global and static variables, stack data, and heap allocated objects.

The Cortex-M7 includes a unified cache that holds both instructions and data in separate regions. Cortex-M4 and M33 have independent instruction and data caches. Other variants like the M3 lack data caching support.

Much like the instruction cache, the processor checks for desired data first in the data cache before accessing main memory. If found, the processor gets the data faster without waiting on the slower memory access. Data caching utilizes the temporal and spatial locality exhibited by program data accesses.

Write Policy

An important consideration for data caching is the cache’s write policy. This determines how the cache handles write operations to cached data:

Write-through – The data is written to both the cache line and main memory. This ensures coherency but has higher latency.

Write-back – The data is only written to the cache line. Main memory is updated later. Low latency but requires cache flushing.

Write-through has the advantage of simpler coherency management. Write-back minimizes latency but needs flushing to synchronize the cache contents back to main memory.

Data Cache Performance Factors

As with instruction caching, data cache performance depends on:

Hit rate – Higher rates reduce accesses to main memory
Memory latency – More cycles waiting on memory makes caching more beneficial
Locality – Programs with good locality get higher hit rates

Coherency – Write invalidations and flushing hurt hit rates

The processor’s data access patterns and memory system characteristics determine the potential advantage. Tight execution loops that operate on local data see the most benefit. Large working set sizes that exceed the cache capacity dilute the benefits.

Cortex-M Cache Implementation

Let’s take a closer look at how data and instruction caching is implemented on certain Cortex-M variants:

Cortex-M3/M4/M7

M3 has no data or instruction cache
M4 has an 8-64KB data cache but no instruction cache
M7 has a unified 64KB cache holding both data and instructions

The Cortex-M4 data cache is 4-way set associative with a write-through policy. Its size ranges from 8-64KB with 4-16 ways. Each cache line stores 32 bytes of data. The M4 data cache reduces the cycles needed to access data variables and arrays by caching them on-chip. This avoids stalling while waiting on off-chip memory reads.

The M7 unified cache stores both instructions and data in separate regions – data accesses don’t evict cached instructions and vice versa. The 64KB cache is 4-way set associative with 32 byte cache lines. The unified caching avoids duplication between instruction and data caches. The M7 sees performance benefits from both instruction and data caching.

Cortex-M33

The Cortex-M33 includes independent instruction and data caches:

64KB 4-way instruction cache with 32 byte lines
8-64KB 4-way data cache with 32 byte lines

The split caches allow simultaneous instruction and data access. The M33 data cache uses a write-back policy to minimize latency and bus utilization. Cache lines are allocated using a pseudo least-recently-used scheme.

The independent M33 caches provide low latency access for both instructions and data. Caching both operands reduces stalls and improves pipeline throughput. The M33 sees significant performance gains from caching.

Enabling and Disabling Caches

In some cases, you may want to disable caching in Cortex-M processors to reduce hit latency and overhead. Reasons include:

Deterministic real-time response – Caching makes access time non-deterministic

Low-memory systems – Caching requires dedicated on-chip memory
Data coherency concerns – Caching complicates coherency management

To disable caches:

Disable caching in the compiler using directives like @no_cache for IAR
Set cache enable register bits to 0 to disable specific caches
Use memory barriers and cache maintenance operations when caching is disabled for code compatibility

Deterministic hard real-time systems often require caches to be disabled. Multicore coherence also may require cache disabling. This trades off throughput for predictability and simpler coordination.

Multicore Cache Coherency

In multiprocessor Cortex-M systems with shared memory, cache coherency is critical. Without proper coherency, different cores could end up with inconsistent views of memory contents. Steps for coherency include:

Snooping other core’s memory transactions

Invalidating own cache lines on writes
Using write-through caches
Cache line flush operations

Barriers/fences to order accesses

The Cortex-M7 includes hardware support for dual-core cache coherency like automatic cache line invalidation on writes. Other variants require careful software management or cache disabling for coherency.

Optimizing Cache Performance

Software optimizations to improve caching performance involve maximizing locality and minimizing coherency overhead. Recommendations include:

Organize data to maximize sequential accesses
Improve temporal locality using loops and function calls
Avoid unnecessary writes to cached shared data

Use memory barriers to structure accesses
Align data to cache line boundaries
Minimize working set size

Toolchain options like loop unrolling and function inlining also help improve instruction cache performance. Profile guided optimization can guide data structure layout.

Conclusion

In summary, instruction and data caching in Cortex-M processors reduces access latency by exploiting locality. Larger caches with higher hit rates provide better performance. Optimized software also improves locality and caching efficiency. Appropriate coherency management ensures correctness across multiple cores. Configurable caching in Cortex-M enables trading off determinism for faster access in embedded applications.

Through caching, Cortex-M processors deliver high performance despite slower memory systems. Caching leverages on-chip memory to hide memory latency and keep the pipelines filled. For memory-intensive workloads, instruction and data caches are key to achieving high throughput in embedded ARM applications.

Instruction and Data Caches in ARM Cortex-M (Explained)

Instruction Cache

Benefits of the Instruction Cache

Instruction Cache Performance Factors

Data Cache

Write Policy

Data Cache Performance Factors

Cortex-M Cache Implementation

Cortex-M3/M4/M7

Cortex-M33

Enabling and Disabling Caches

Multicore Cache Coherency

Optimizing Cache Performance

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

Cortex-M1 address translation when accessing PS DDR memory

What is the difference between link register and stack?

ARM Cortex M0+ Integer Division

How much memory does the Cortex-M23 have?