The instruction cache in ARM Cortex-M series microcontrollers is a small, fast memory that stores recently accessed instructions to improve performance. It sits between the CPU and the main memory, caching instructions so the CPU does not have to access slower external memory as frequently. This speeds up instruction fetches and overall execution.
What is an Instruction Cache?
An instruction cache is a hardware cache used to speed up execution of computer programs by reducing wait times for fetch instructions from memory. It is a memory bank that stores copies of recently used instructions from the main memory. When the processor needs to read instruction data, it first checks the instruction cache. If the instructions are present, they are read from the faster cache memory instead of the slower external memory.
Caches exploit the locality of references in typical programs – instructions that are executed close together in time tend to also be located close together in memory. The cache essentially provides a buffer storing recent instruction history so that if the instruction is reused, it does not have to be re-fetched from main memory.
Why Have an Instruction Cache?
There is a significant speed difference between a processor’s clock speed and main memory access times. This speed gap means the CPU is often idle, waiting for instructions to be fetched from memory. The instruction cache reduces this wait time by providing faster access to cached instructions.
Main benefits of an instruction cache:
- Improves performance by reducing the average cost and time for instruction fetches
- Increases the speed that a processor can execute instructions and programs
- Helps avoid stalling the CPU on instruction fetch cycles
- Makes better use of the faster CPU clock cycles
- Buffers recent instruction history to capitalize on locality of reference principles
Overall, the instruction cache improves performance by taking advantage of the program locality and bridging the gap between the CPU and main memory speeds.
Instruction Cache in ARM Cortex-M
The ARM Cortex-M series are 32-bit RISC ARM processor cores designed for embedded and IoT applications. Many Cortex-M variants have an integrated instruction cache module. For example:
- Cortex-M7 includes a 4-way set-associative instruction cache
- Cortex-M4 includes a 2-way set-associative instruction cache
- Cortex-M3 does not have an instruction cache
The presence and size of the instruction cache differs between models. But in all cases it serves the same purpose – reducing instruction fetch times to improve performance.
Cortex-M7 Instruction Cache
The Cortex-M7 includes a 4-way set-associative instruction cache with several configurable parameters:
- Total cache size – 8KB to 64KB
- Line length – 4 words to 8 words
- Read latency – 1 to 15 cycles
- Number of sets – 128 to 1024
The CPU first checks the instruction cache when fetching instructions. A cache hit reduces the access time to just 1 cycle. A cache miss requires fetching from slower external memory.
The 4-way set associativity improves hit rate and performance. It allows four different cache lines to reside in a single set. The Least Recently Used (LRU) algorithm determines which line is evicted when a new line is fetched.
Cortex-M4 Instruction Cache
The Cortex-M4 integrates a simpler 2-way set-associative instruction cache with the following features:
- Fixed 2-way set associativity
- 128 sets
- Line length fixed at 8 words (32 bytes)
- Variable cache size from 1KB to 8KB
- Read latency from 1 to 8 cycles
Again, the cache is checked first for instruction fetches. A hit provides the instruction in just 1 cycle. The 2-way associativity improves performance compared to a direct-mapped cache.
How the Instruction Cache Works
The instruction cache contains cached instructions in organized lines and sets. It is managed by cache policies like:
- Cache line fetch – Fetches aligned blocks of instructions
- Write policy – Cortex-M uses a read-only cache so writes go directly to memory
- Allocation policy – Fetch and cache new instructions on a miss
- Replacement policy – Replace old lines using LRU algorithm
When the CPU requests an instruction fetch, the cache is checked in parallel with looking up the physical memory address. If the cache hit occurs, the instruction is returned from the cache. Otherwise, a fixed block of instructions is fetched from memory, cached, and returned to the CPU.
The cache lines are marked as empty upon reset. Cache misses cause line fills until the working set fits inside the cache. After this warm-up period, the hit rate increases. The LRU replacement policy aims to retain the most frequently used lines in the cache.
Cache Coherency
The instruction cache remains coherent with memory using strategies like:
- Write-Through – Writes go to cache and memory
- Non-allocating – Fetched lines are marked invalid on writes
- Non-caching buffer – Buffers writes to retire before setting cache line status
These maintain coherency by ensuring memory is updated correctly. The cache contents remain a subset of memory. Cortex-M4/M7 use a non-allocating write buffer to track writes before marking instructions invalid.
Cache Maintenance Operations
Software cache maintenance operations are provided to invalidate and flush the cache. These operations improve coherency in multiprocessor systems and other niche cases. The CPU ensures correct sequencing of cache and memory accesses around these operations.
Performance Impact
The instruction cache improves performance when:
- There is instruction locality/reuse in the code
- Instructions are not uniformly spread over memory
- Multiple accesses to same functions/loops occur
In best cases, the instruction cache can provide near 1 cycle effective memory access time. But the hits depend heavily on the code itself. Poorly optimized code may have limited locality and low hit rates. The cache cannot improve random or non-reusing code.
Cache performance also varies based on size. A smaller cache has limited ability to retain hot code sections. Optimizing programs to fit key loops and functions into the cache is important.
When the Cache Fails
There are cases where the instruction cache provides no benefit or even worsens performance:
- Code with little or no temporal/spatial instruction locality
- Data dependent branches that cause constant cache misses
- Thrashing when cache is too small for working set code
- Cache maintenance overhead nullifies hits
For these adverse cases, the cache can be disabled globally or specific regions can be marked as non-cacheable. The penalty is higher average access time without the cache.
Instruction Cache Design Considerations
Key cache design decisions that impact performance:
- Total cache size – Bigger cache retains more hot code
- Line length – Longer lines reduce misses
- Associativity – More ways increase hit rates
- Read latency – Lower is better but uses more power
- Write policy – Balancing coherence overhead
These factors determine overall hit rate, average access time, power usage, and implementation cost. Optimal configuration depends on balancing application requirements like deterministic real-time behavior, low-energy, and maximum performance.
Summary
The instruction cache in ARM Cortex-M microcontrollers is an on-chip memory that stores a subset of recently used instructions to reduce accesses to slower off-chip memories. It exploits spatial and temporal locality principles to buffer hot code sections. When programmed and utilized effectively, the instruction cache significantly increases the instruction fetch performance.