SoC
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
  • Arm Cortex M3
  • Contact
Reading: Reducing Load/Store Instruction Latency on Cortex M4
SUBSCRIBE
SoCSoC
Font ResizerAa
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Search
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Have an existing account? Sign In
Follow US
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
© S-O-C.ORG, All Rights Reserved.
Arm Cortex M4

Reducing Load/Store Instruction Latency on Cortex M4

David Moore
Last updated: October 5, 2023 10:08 am
David Moore 7 Min Read
Share
SHARE

The Cortex-M4 processor is designed to provide high performance and low power consumption in embedded applications. However, the load and store instructions can exhibit high latency due to cache misses and bus contention. This article will examine techniques to reduce the load/store latency and improve overall performance on Cortex-M4 based systems.

Contents
Understanding Load/Store LatencyAnalyzing Load/Store BehaviorOptimizing Memory ArchitectureCache ConfigurationBus Bandwidth AllocationFaster Memory MappingCompiler OptimizationsManual Assembly CodingData Access PatternsParallelism TechniquesSIMD InstructionsMultithreadingTransaction Level ParallelismHardware AccelerationConclusion

Understanding Load/Store Latency

The Cortex-M4 has a three stage pipeline – Fetch, Decode and Execute. The load and store instructions access the data bus during the Execute stage. However, if the requested data is not available in the level 1 data cache, it leads to a cache miss. Fetching the data from lower level memories introduces additional latency cycles. The processor stalls until the data becomes available. This increases the effective execution time of the load/store instruction.

In addition to cache misses, the load/store instructions also contend for the shared data bus. So even if the data is present in L1 cache, the instruction may have wait due to ongoing bus transactions by other masters. The Cortex-M4 bus interface uses a round-robin scheduler to arbitrate between multiple bus masters. But this can still lead to high latency for load/store instructions.

Analyzing Load/Store Behavior

To reduce load/store latency, we first need to analyze their behavior in the target application. Some key aspects to examine are:

  • Frequency of load/store instructions
  • Cache hit/miss ratios for data accesses
  • Bus contention scenarios
  • Nature of data – random or sequential addresses
  • Possibility of conflicts between instruction and data accesses

Profiling tools like Arm DS-5 Streamline can be used to get detailed statistics regarding stall cycles, cache performance and bus transactions. This data will pinpoint the exact load/store instructions that need optimization.

Optimizing Memory Architecture

The first step is optimizing the memory architecture for our specific application requirements. This involves tuning the cache configuration, bus bandwidth allocation and mapping critical data into faster memories.

Cache Configuration

The Cortex-M4 supports different cache sizes up to a max of 64 KB. Choosing an optimal cache size avoids wastage of on-chip memory. The cache also needs to be configured as Harvard architecture with separates instruction and data caches. This prevents contention between instruction fetches and data accesses.

If the application has significant temporal locality in data accesses, increasing the cacheline size can help improve hit rate. Configuring cache write policy as write-through or write-back is another parameter than can be tuned.

Bus Bandwidth Allocation

The bus matrix allows assigning priorities and bandwidth allocation for different bus masters. Latency sensitive transactions like data cache misses should be assigned higher priority compared to bulk data transfers by DMA. Using the Arm CoreLink NIC-400 network interconnect can also help reduce contention.

Faster Memory Mapping

DDR memory has higher access latency than on-chip SRAM. Critical data structures and code segments can be mapped into on-chip memory regions to avoid going to external DRAM. Similarly, flash access can be accelerated by mapping hot code regions into SRAM rather than executing directly from flash.

Compiler Optimizations

The compiler plays a major role in scheduling load/store instructions to minimize stalls. Following compiler techniques help reduce latency:

  • Grouping loads before stores to exploit temporal locality
  • Software pipelining to overlap cache misses
  • Loop inversion and interchange to improve locality
  • Unrolling small loops that reuse data
  • Allocating variables to registers instead of memory

Compiler flags like -O3 enable the highest level of optimizations. The Arm Compiler 6 toolchain has “optimize for speed” and “optimize for size” modes that impact load/store ordering.

Manual Assembly Coding

For critical code sections, assembly level programming can be used to carefully schedule load/store instructions. Some manual optimizations include:

  • Separating instruction and data accesses
  • Using double buffering techniques
  • Aligning data structures to cache line size
  • Prefetching data using PLD instruction
  • Unrolling small loops for better pipelining

The LDREX and STREX atomic instructions are also useful for synchronization instead of mutexes.

Data Access Patterns

How the data is accessed can significantly impact load/store latency. Sequential accesses perform better than random accesses. Consolidating disjoint data into structures improves locality. Some ways to optimize data access include:

  • Migrating from linked lists to contiguous arrays
  • Sorting data based on access order
  • Blocking for matrix or image processing
  • Storing multiple elements in single physical location
  • Using DMA for bulk transfers

Software prefetching using PLD instruction is useful for predictive loading of data elements.

Parallelism Techniques

Various parallel computing techniques can help hide the load/store latency by executing independent instructions during stall cycles.

SIMD Instructions

The Cortex-M4 SIMD instructions operate on multiple data elements concurrently. This improves performance for multimedia and signal processing code.

Multithreading

Executing multiple threads or tasks on a multi-core Cortex-M4 system allows overlapping computation and data access. Context switching helps tolerate latency.

Transaction Level Parallelism

Multiple outstanding bus transactions can be issued to DRAM to maximize bandwidth utilization. The Arm AXI protocol enables out-of-order completion to improve overlap.

Hardware Acceleration

Additional hardware blocks can offload the Cortex-M4 core from intensive data processing tasks. Some options are:

  • DMA engines for efficient bulk transfers
  • External accelerators and co-processors
  • Cryptographic coprocessors for encryption
  • Graphic accelerators for image processing

This avoids the M4 core stalling on frequent load/store accesses.

Conclusion

Optimizing load/store instruction latency requires a combination of memory architecture tuning, compiler optimizations, efficient data access patterns and parallelism techniques. Profiling the application usage is key to identifying optimization opportunities. For minimal latency, critical code segments can be hand-coded in assembly language. Hardware accelerators are useful to offload intensive data processing tasks. Overall, a balanced system-level approach is needed to maximize Cortex-M4 performance.

Newsletter Form (#3)

More ARM insights right in your inbox

 


Share This Article
Facebook Twitter Email Copy Link Print
Previous Article Pipelining Instructions After LDR vs STR on Cortex M4
Next Article LDR instructions in Arm Cortex-M
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

2k Followers Like
3k Followers Follow
10.1k Followers Pin
- Sponsored-
Ad image

You Might Also Like

Techniques for Dealing with SysTick’s 24-bit Counter (Cortex-M4)

The 24-bit SysTick counter in Cortex-M4 can be tricky to…

6 Min Read

Cortex M4 Write Buffer Explained

The Cortex-M4 processor includes a write buffer to improve performance…

16 Min Read

Demystifying Cortex M4 LDR/STR Instruction Timing

The Cortex-M4 processor implements the ARMv7E-M architecture. One of the…

6 Min Read

Tips for Using the FPU on Cortex-M4 Efficiently

The Cortex-M4 processor includes a single precision floating point unit…

8 Min Read
SoCSoC
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
Welcome Back!

Sign in to your account