Cache memory in ARM processors refers to small, fast memory units integrated into the processor that store frequently accessed data and instructions to improve performance. The cache sits between the CPU and main memory and serves as a temporary staging area, keeping data that the CPU will likely need again in the near future. This allows the processor to access cache data much faster compared to relatively slower main memory access.
Purpose of Cache Memory
The key purpose of cache memory is to reduce the average latency of memory accesses and improve performance. Main memory in computers has continued to lag behind CPU speeds, leading to a disconnect where the processor has to stall while waiting for data from memory. Cache memory helps bridge this gap by exploiting locality of reference principles:
- Temporal locality – Recently accessed data items are likely to be accessed again soon
- Spatial locality – Data items with nearby addresses tend to be referenced close together in time
Cache memory stores recent and adjacent data for quick access. In addition, cache also helps avoid re-fetching duplicate instructions and data from main memory. By providing rapid access to frequently used data and instructions, cache significantly improves memory subsystem performance.
Cache Organization in ARM
ARM processors utilize multiple levels of cache in a hierarchical design to fully exploit locality. Lower level caches integrated into the CPU core are very fast but smaller in capacity. Higher cache levels are progressively larger but have longer access times. Data is moved between cache levels and main memory in cache lines (fixed size blocks).
A typical ARM cache hierarchy consists of:
- L1 Cache – Split into separate instruction and data caches. Very low (1-3 cycle) access latency.
- L2 Cache – Unified cache for both instructions and data. Low latency.
- L3 Cache – Optional. For advanced multicore ARM processors. Very large capacity.
- Main Memory – Large DRAM accessed on cache misses. High (10s-100s cycle) latency.
ARM implementsinclusive caching where cache levels subsume the contents of higher levels. So L1 cache is a subset of L2, which includes L3 and main memory data. This simplifies cache coherence between levels.
Cache Mapping Techniques
To locate data between cache and memory, ARM utilizes virtual indexing and physical tagging. The processor generates a virtual address which includes a virtual page number (VPN), offset, and tag. The VPN is sent to the Memory Management Unit (MMU) to translate to a physical page number. The offset indexes into the selected cache set. The tag holds partial physical address info used for matching on cache lookups.
ARM caches are physically indexed and physically tagged (PIPT). Since virtual addresses may map to any physical location, virtual indexing would result in aliasing issues. By using physical indexes, cache access can start in parallel with virtual to physical address translation.
ARM employs three cache mapping policies:
- Direct mapped – Each cache block maps to only one cache set
- Fully associative – A cache block can be placed in any cache set
- Set associative – Compromise between direct and full associativity. Cache is divided into sets with a fixed number of blocks per set.
Set associative mapping is commonly used in ARM as it provides a good balance between hit rate, access time, and design complexity. Popular configurations are 4-way and 8-way set associative caches.
Cache Coherence Protocols
With multiple caches and cores, ARM employs coherence protocols to maintain consistency between cache copies. Coherency ensures changes in one cache are reflected in other caches to prevent reading stale data. ARM utilizes:
- Snooping – Bus-based protocol where caches snoop on each others’ transactions. Good for smaller number of cores.
- Directory-based – Central directories track cache line states. Scales better for more cores.
Both schemes rely on establishing ownership/exclusivity for cache lines. The owner cache has the right to modify a line while other caches only hold read-only copies. On a write miss, ownership requests go out to invalidate other copies.
Cache Performance Optimization
ARM employs various techniques to optimize cache utilization and minimize misses:
- Write buffers – Cache write misses go into write buffers while readout continues uninterrupted from cache.
- Load/store reordering – Scheduling load/stores out-of-order to prevent stalls.
- Prefetching – Predicting future accesses and bringing data into cache ahead of time.
- Way prediction – Predicting correct cache way to reduce access time.
Advanced ARM processors may also implement compression in cache to increase effective capacity and allocate cache dynamically between cores depending on workloads.
Conclusion
In summary, cache memory plays a critical role in ARM processors by exploiting locality principles and providing fast access to frequently used data. Multiple cache levels arranged hierarchically help balance access speed, hit rate, and cost. ARM employs modern caching techniques like set associativity, PIPT, snoop/directory protocols, write buffering, and prefetching to maximize performance. Caches help bridge the processor-memory performance gap and make ARM an efficient architecture for embedded and mobile designs.