The data cache in Arm Cortex-M series microcontrollers is a small, fast memory that stores copies of data from the main memory. The purpose of the data cache is to reduce the number of accesses to the main memory, which is slower, and improve the performance of data retrieval operations.
Overview of Caches
In computer systems, caches are small memories used to store copies of frequently used data. They serve as temporary staging areas for data that the processor is likely to need next. Reading data from a cache is much faster than reading from main memory.
Caches exploit the locality of reference principle – the tendency for programs to reuse data and instructions they have used recently. By keeping copies of recently accessed data in the fast cache, the processor avoids having to read slower main memory every time that data is needed.
Benefits of Caching
The key benefits of using caches are:
- Reduced latency – Cache hits are faster than memory reads
- Increased throughput – By reducing stalls due to main memory reads, the processor can do more work
- Lower power consumption – Caches require less power than main memory
- Simplified design – Caches hide memory latency from the processor
Cache Organization
Caches have a cache controller, cache memory, and cache directory. The cache controller manages the data flow between the main memory and the cache memory. The cache memory stores the actual copies of data. And the cache directory stores the mapping between memory addresses and cache locations.
Caches are organized into cache lines (or blocks). Each cache line corresponds to a contiguous block of memory that is copied as a unit to the cache. Typical cache line sizes range from 16 to 128 bytes. Data is moved between memory and cache in units of cache lines.
Cache Operation
When the processor needs to read data, it first checks if the data is present in the cache. If so, a cache hit occurs and the data is returned quickly. If not, a cache miss occurs and the data must be read from the slower main memory.
On a cache miss, a cache line containing the requested data is copied from memory into the cache. Other data at this cache location is evicted. The processor also reads ahead and pulls more data into the cache. This prefetching exploits spatial locality to improve performance.
Write Policies
With write operations, caches implement either a write-through or write-back policy. In a write-through cache, data is written to both the cache and main memory. In a write-back cache, data is only written to the cache initially. Writes are forwarded to main memory later when the cache line is evicted.
Data Cache in Cortex-M
The data cache in Cortex-M microcontrollers is a 4-way set associative write-through cache. It has a configurable size up to 32 Kbytes. The cache line size is 4 words (16 bytes).
The Cortex-M data cache sits between the CPU and the bus matrix. It helps reduce bus traffic and memory latency. The cache has allocators to buffer and align data. There are also write buffers to hold pending writes until the bus is available.
Cache Features
Key features of the Cortex-M data cache include:
- 4-way set associative organization
- Write-through policy
- Allocate on reads
- Non-allocate on writes
- 16 byte cache lines
- LRU replacement policy
- Optional ECC protection
Cache Maintenance
The Cortex-M cache controller provides cache maintenance operations to manage cache coherence and consistency. These operations include:
- Invalidate – Mark cache line as invalid
- Clean – Write dirty data to memory
- Clean and invalidate – Clean then invalidate cache line
- Flush – Clean and invalidate entire cache
The ARMv7 architecture defines special cache maintenance instructions for these operations. The processor can perform maintenance operations on a single line, a cache set, or the entire cache.
Cache Coherence
In multicore Cortex-M systems, each core has its own data cache. ARM recommends using a modified Harvard cache architecture to maintain coherence. Instruction caches are kept coherent using hardware mechanisms. For data, a software cache coherence protocol is defined.
The protocol involves flushing or cleaning data caches at synchronization points. Multicore semaphores, locks, and shared data structures are designed to force cache maintenance operations when entering and exiting critical sections. This prevents cores from operating on stale cached data.
Cache Performance
The performance benefit of caching depends on the cache hit rate. This is the fraction of memory accesses that are satisfied by the cache without accessing main memory. A higher hit rate results in lower average memory access time.
The hit rate depends on the cache size, access locality of the application, and other policies like replacement and write strategy. By optimizing cache usage, a system can significantly improve performance.
Guidelines for Optimizing Cache Performance
Here are some guidelines for optimizing cache performance in a Cortex-M system:
- Organize data structures to maximize spatial locality and sequential access
- Improve temporal locality by reusing data and instructions
- Increase cache size if the hit rate is low
- Optimize cache friendly code to maximize cache hits
- Minimize cache misses by prefetching data
- Use cache coloring to avoid conflict misses
- Leverage multi-core coherence protocols
- Allocate stack and global variables cache optimally
Profiling cache behavior and optimizing based on real usage is key. Tools like ARM Streamline can be used to analyze cache performance.
Configuring Cache in Cortex-M
The Cortex-M data cache is highly configurable via processor registers. Key configuration options include:
- Enabling/disabling the cache
- Setting cache size from 4KB to 32KB
- Way size and associativity
- Burst length for cache refills
- Memory access latency for misses
- Shared attribute for multiprocessor coherence
- Cache control register settings
At runtime, the cache can be enabled/disabled by manipulating the Cache Enable bit in the Auxiliary Control Register. The cache is disabled on reset.
Use Cases
Some typical use cases for leveraging the Cortex-M data cache are:
- Storing frequently used data structures
- Caching code sections to improve instruction fetch performance
- Buffering data transferred over low bandwidth buses
- Avoiding wait states when accessing high latency memories
- Prefetching data for fast signal processing algorithms
For memory intensive applications, the data cache can help avoid stalls and improve throughput. It works best when access patterns have good locality.
Limitations
While caches improve performance, they have some limitations:
- Added latency on cache misses
- Complexity of cache coherence in multicore systems
- Overhead of managing cache with limited memory
- Power consumption of cache memories
- Difficult to optimize due to non-deterministic behavior
The benefits of caching may be less noticeable for small, deterministic real-time systems. Cache usage should be tailored to the application requirements.
Conclusion
The Cortex-M data cache reduces memory latency by storing local copies of frequently used data. It improves performance by exploiting locality of memory accesses in embedded applications. Cache optimization can provide significant speedups for memory-bound use cases.
Understanding cache organization, operation, and configuration is key to utilizing it effectively. Paying attention to cache usage and tuning cache policies accordingly helps unlock the benefits of caching in embedded Arm processors.