The Cortex-M4 processor implements the ARMv7E-M architecture. One of the key features of this architecture is the LDR (load register) and STR (store register) instructions which allow data to be transferred between memory and registers. However, the timing of these instructions can sometimes be unclear. This article will provide a detailed look at how the LDR and STR instructions work on the Cortex-M4, including their pipeline stages and timing.
LDR/STR Instruction Overview
The LDR and STR instructions on the Cortex-M4 processor allow transferring a word (32-bit) between a register and memory. The syntax is: LDR Rd, [Rn, #offset] STR Rd, [Rn, #offset]
Where:
- Rd is the destination register
- Rn is the base register containing the address
- Offset is an optional offset from the base address
Some examples: LDR R1, [R2, #8] ; Load word from address in R2 + 8 into R1 STR R5, [R3] ; Store R5 into address in R3
The key thing to note is that the memory access happens using the address obtained by adding the base register and offset. This provides flexibility in accessing different memory locations.
Cortex-M4 Pipeline Stages
To understand the timing of LDR/STR instructions, we need to first look at the pipeline stages of the Cortex-M4 processor. The pipeline consists of 3 main stages:
- Fetch – Instruction is fetched from memory
- Decode – Instruction is decoded into microoperations
- Execute – Instruction is executed
In addition, memory access instructions like LDR/STR have 2 extra stages:
- Memory – Address is sent to memory
- Writeback – Write data back to register
So in total 5 stages are involved for a memory access instruction. The stages are executed sequentially, so each stage takes one clock cycle to complete.
LDR Instruction Timing
When a LDR instruction is executed on the Cortex-M4, it goes through the following steps:
- Fetch – LDR instruction fetched from memory
- Decode – LDR instruction decoded into microops
- Execute – Address calculated using base register + offset
- Memory – Address sent to memory and word loaded
- Writeback – Loaded word written back to destination register
Since each stage takes 1 clock cycle, the total time taken is 5 clock cycles. So the timing diagram for a LDR instruction looks like: Cycle 1: Fetch Cycle 2: Decode Cycle 3: Execute Cycle 4: Memory Cycle 5: Writeback
So in summary, a LDR instruction takes 5 clock cycles to complete on the Cortex-M4.
STR Instruction Timing
The STR instruction timing is similar to LDR, with 5 pipeline stages:
- Fetch – STR instruction fetched
- Decode – STR decoded into microops
- Execute – Address calculated
- Memory – Address and data sent to memory
- Writeback – None
So again, the total time is 5 clock cycles. The timing diagram is: Cycle 1: Fetch Cycle 2: Decode Cycle 3: Execute Cycle 4: Memory Cycle 5: Writeback (no operation)
In summary, STR also takes 5 clock cycles to complete execution.
Load/Store Multiple Instructions
The LDM and STM instructions on Cortex-M4 allow transferring multiple words between memory and registers. For example: LDM R1!, {R2-R5} ; Load words into R2-R5 from address in R1 STM R3!, {R4-R8} ; Store R4-R8 into address in R3
These involve iterating the load/store operation multiple times. The timing depends on how many registers are being transferred:
- 1 register = 5 cycles
- 2 registers = 10 cycles
- 3 registers = 15 cycles
And so on. So for N registers, the total time is 5N clock cycles.
Memory Wait States
The LDR/STR timing shown above assumes a single cycle memory access. However, accessing slower memories can require wait states. Cortex-M4 allows configuring 0-15 wait states for each memory region.
Each wait state inserts an additional stall cycle in the pipeline during the Memory stage. For example, with 3 wait states: Cycle 1: Fetch Cycle 2: Decode Cycle 3: Execute Cycle 4: Memory (Stall) Cycle 5: Memory (Stall) Cycle 6: Memory (Stall) Cycle 7: Memory Cycle 8: Writeback
So with N wait states, the total time becomes 5 + N clock cycles.
Other Factors Affecting Timing
There are some other considerations as well when looking at LDR/STR timing:
- Pipeline interlocks can add stalls and increase timing.
- Cache hits vs misses will affect the memory access time.
- Bus contention from other masters can delay memory access.
- Unaligned accesses may require extra cycles to handle.
So in a complex system, actual timings can vary quite a bit from the ideal scenarios described here. But this provides a baseline understanding to build upon.
Summary
Key points:
- LDR and STR on Cortex-M4 take 5 cycles under ideal conditions.
- Load/store multiple timing depends on number of registers.
- Wait states can be added to account for slow memory.
- Real-world timings are affected by many other factors.
By understanding the pipeline and how instructions flow through it, we can get a better idea of the Load/Store timing. This sets realistic performance expectations and also helps identify optimization opportunities in code.