The Cortex-M3 processor has advanced memory access capabilities through the use of caches and shared memory regions. However, these features also impose certain constraints that need to be understood when developing applications. Careful memory architecture design is required to avoid hazards from overlapped or out-of-order memory accesses.
Instruction and Data Caches
The Cortex-M3 instruction and data caches are 4-way set associative with a least recently used (LRU) replacement policy. They can be configured as follows:
- Separate instruction and data caches up to 16 KB each
- Unified 16 KB cache for both instructions and data
- Cache disabled
When enabled, the caches significantly reduce the number of external memory accesses and improve performance. However, they also introduce the possibility of stale data if memory contents change.
Cache Coherency Issues
If multiple bus masters are present, care must be taken to maintain cache coherency. For example, if the Cortex-M3 cache contains a copy of a memory location that is modified by a DMA transfer, the cache will now contain a stale value. Several techniques can help maintain coherency:
- Invalidate the entire cache when DMA transfers complete
- Invalidate cache lines corresponding to DMA target addresses
- Designate DMA regions as “non-cacheable”
- Use hardware coherency support if present
In addition to DMA transfers, self-modifying code can also lead to instruction cache coherency issues. The processor must load updated instructions after code has been modified.
Implications of Cached Memory Regions
When accessing cached memory regions, keep the following in mind:
- The caches have no visibility into ongoing external memory accesses
- A memory update may not be immediately visible to the core if a cache hit occurs
- Code sequences should not rely on precise ordering of memory accesses
- Explicit cache maintenance operations should be used as needed for coherency
Tightly Coupled Memories
In addition to caches, the Cortex-M3 also supports tightly coupled memories (TCMs). These are small low latency RAM regions that can store code and data. The processor can access TCMs in a single cycle, providing deterministic access without requiring cache flushes.
TCM Features
- Single cycle access eliminates wait states
- Ideal for performance critical code/data
- Simplifies timing analysis by avoiding variable delays
- TCM contents are guaranteed coherent with respect to external memory accesses
TCM Usage Constraints
However, TCMs have limited capacity and programmers should be aware of some constraints:
- Fetching instructions from TCM eliminates prefetch buffering
- Data TCMs are not suitable for DMA access
- Entries are typically allocated at compile time, not dynamically
Shared Memory Regions
The Cortex-M3 also permits designating certain memory regions as “shared”. Accesses to shared memory have ordering requirements that prevent reordering by the CPU, thus avoiding hazards.
When to Use Shared Memory
Shared memory regions are useful when hardware needs to coordinate with program memory accesses. Typical use cases include:
- Memory-mapped I/O registers
- DMA buffer pools
- Command queues
- Hardware semaphores
Ordering Constraints
The following rules apply to memory marked as shared:
- Instructions may not be reordered around accesses
- Accesses to shared regions have program order
- All caches and write buffers are flushed before and after access
This prevents situations where a pending cache line write back changes memory after a subsequent instruction has accessed it.
Performance Impact
Due to the strict ordering requirements, shared memory accesses are typically slower than unrestricted memory. Performance critical data structures and code should remain in normal memory.
Atomic Accesses
The Cortex-M3 supports atomic read-modify-write operations on memory for synchronization primitives like semaphores. These operate on shared regions to prevent simultaneous access.
Atomic Instructions
- LDREX – Load exclusive register
- STREX – Store exclusive register
- CLREX – Clear exclusive monitor
A LDREX/STREX pair performs an atomic read-modify-write sequence. CLREX simply clears the exclusive monitor.
Hazards
Care must be taken with atomic instruction sequences since they impose hazards:
- Any exception between LDREX and STREX will abort sequence
- LDREX marks a region pending writeback, limiting parallelism
- Only one LDREX/STREX pair can be pending at once
Memory Access Guidelines
Given the complex interactions between caching, TCM regions, shared memory, and atomic accesses, follow these guidelines for robust operation:
- Ensure cache coherency with DMA transfers and self-modifying code
- Use TCM for critical code/data but beware size constraints
- Designate shared regions appropriately to avoid reordering
- Use atomic operations carefully to prevent aborts
- Analyze all memory access paths to identify potential issues
Conclusion
The Cortex-M3 provides very capable memory access mechanisms, but they also impose constraints that software must consider carefully. Keeping caches coherent, avoiding shared memory hazards, and using atomic operations correctly is key. With good architecture and access hygiene, the advanced memory system can be safely leveraged to build high performance real-time embedded applications.