Demystifying Cortex M4 LDR/STR Instruction Timing

The Cortex-M4 processor implements the ARMv7E-M architecture. One of the key features of this architecture is the LDR (load register) and STR (store register) instructions which allow data to be transferred between memory and registers. However, the timing of these instructions can sometimes be unclear. This article will provide a detailed look at how the LDR and STR instructions work on the Cortex-M4, including their pipeline stages and timing.

Contents

LDR/STR Instruction Overview Cortex-M4 Pipeline Stages LDR Instruction Timing STR Instruction Timing Load/Store Multiple Instructions Memory Wait States Other Factors Affecting Timing Summary

LDR/STR Instruction Overview

The LDR and STR instructions on the Cortex-M4 processor allow transferring a word (32-bit) between a register and memory. The syntax is: LDR Rd, [Rn, #offset] STR Rd, [Rn, #offset]

Where:

Rd is the destination register
Rn is the base register containing the address
Offset is an optional offset from the base address

Some examples: LDR R1, [R2, #8] ; Load word from address in R2 + 8 into R1 STR R5, [R3] ; Store R5 into address in R3

The key thing to note is that the memory access happens using the address obtained by adding the base register and offset. This provides flexibility in accessing different memory locations.

Cortex-M4 Pipeline Stages

To understand the timing of LDR/STR instructions, we need to first look at the pipeline stages of the Cortex-M4 processor. The pipeline consists of 3 main stages:

Fetch – Instruction is fetched from memory
Decode – Instruction is decoded into microoperations
Execute – Instruction is executed

In addition, memory access instructions like LDR/STR have 2 extra stages:

Memory – Address is sent to memory
Writeback – Write data back to register

So in total 5 stages are involved for a memory access instruction. The stages are executed sequentially, so each stage takes one clock cycle to complete.

LDR Instruction Timing

When a LDR instruction is executed on the Cortex-M4, it goes through the following steps:

Fetch – LDR instruction fetched from memory

Decode – LDR instruction decoded into microops
Execute – Address calculated using base register + offset
Memory – Address sent to memory and word loaded

Writeback – Loaded word written back to destination register

Since each stage takes 1 clock cycle, the total time taken is 5 clock cycles. So the timing diagram for a LDR instruction looks like: Cycle 1: Fetch Cycle 2: Decode Cycle 3: Execute Cycle 4: Memory Cycle 5: Writeback

So in summary, a LDR instruction takes 5 clock cycles to complete on the Cortex-M4.

STR Instruction Timing

The STR instruction timing is similar to LDR, with 5 pipeline stages:

Fetch – STR instruction fetched
Decode – STR decoded into microops

Execute – Address calculated
Memory – Address and data sent to memory
Writeback – None

So again, the total time is 5 clock cycles. The timing diagram is: Cycle 1: Fetch Cycle 2: Decode Cycle 3: Execute Cycle 4: Memory Cycle 5: Writeback (no operation)

In summary, STR also takes 5 clock cycles to complete execution.

Load/Store Multiple Instructions

The LDM and STM instructions on Cortex-M4 allow transferring multiple words between memory and registers. For example: LDM R1!, {R2-R5} ; Load words into R2-R5 from address in R1 STM R3!, {R4-R8} ; Store R4-R8 into address in R3

These involve iterating the load/store operation multiple times. The timing depends on how many registers are being transferred:

1 register = 5 cycles
2 registers = 10 cycles

3 registers = 15 cycles

And so on. So for N registers, the total time is 5N clock cycles.

Memory Wait States

The LDR/STR timing shown above assumes a single cycle memory access. However, accessing slower memories can require wait states. Cortex-M4 allows configuring 0-15 wait states for each memory region.

Each wait state inserts an additional stall cycle in the pipeline during the Memory stage. For example, with 3 wait states: Cycle 1: Fetch Cycle 2: Decode Cycle 3: Execute Cycle 4: Memory (Stall) Cycle 5: Memory (Stall) Cycle 6: Memory (Stall) Cycle 7: Memory Cycle 8: Writeback

So with N wait states, the total time becomes 5 + N clock cycles.

Other Factors Affecting Timing

There are some other considerations as well when looking at LDR/STR timing:

Pipeline interlocks can add stalls and increase timing.
Cache hits vs misses will affect the memory access time.
Bus contention from other masters can delay memory access.

Unaligned accesses may require extra cycles to handle.

So in a complex system, actual timings can vary quite a bit from the ideal scenarios described here. But this provides a baseline understanding to build upon.

Summary

Key points:

LDR and STR on Cortex-M4 take 5 cycles under ideal conditions.
Load/store multiple timing depends on number of registers.
Wait states can be added to account for slow memory.

Real-world timings are affected by many other factors.

By understanding the pipeline and how instructions flow through it, we can get a better idea of the Load/Store timing. This sets realistic performance expectations and also helps identify optimization opportunities in code.

Demystifying Cortex M4 LDR/STR Instruction Timing

LDR/STR Instruction Overview

Cortex-M4 Pipeline Stages

LDR Instruction Timing

STR Instruction Timing

Load/Store Multiple Instructions

Memory Wait States

Other Factors Affecting Timing

Summary

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

Understanding Pipeline Hazards in Cortex-M4 Microcontrollers

Does arm cortex-M4 have stages of pipeline?

Tips for Using the FPU on Cortex-M4 Efficiently

Reducing Context Switch Overhead with FPU Registers on Cortex-M4