SoC
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
  • Arm Cortex M3
  • Contact
Reading: Demystifying Cortex M4 LDR/STR Instruction Timing
SUBSCRIBE
SoCSoC
Font ResizerAa
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Search
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Have an existing account? Sign In
Follow US
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
© S-O-C.ORG, All Rights Reserved.
Arm Cortex M4

Demystifying Cortex M4 LDR/STR Instruction Timing

Andrew Irwin
Last updated: October 5, 2023 10:08 am
Andrew Irwin 6 Min Read
Share
SHARE

The Cortex-M4 processor implements the ARMv7E-M architecture. One of the key features of this architecture is the LDR (load register) and STR (store register) instructions which allow data to be transferred between memory and registers. However, the timing of these instructions can sometimes be unclear. This article will provide a detailed look at how the LDR and STR instructions work on the Cortex-M4, including their pipeline stages and timing.

Contents
LDR/STR Instruction OverviewCortex-M4 Pipeline StagesLDR Instruction TimingSTR Instruction TimingLoad/Store Multiple InstructionsMemory Wait StatesOther Factors Affecting TimingSummary

LDR/STR Instruction Overview

The LDR and STR instructions on the Cortex-M4 processor allow transferring a word (32-bit) between a register and memory. The syntax is: LDR Rd, [Rn, #offset] STR Rd, [Rn, #offset]

Where:

  • Rd is the destination register
  • Rn is the base register containing the address
  • Offset is an optional offset from the base address

Some examples: LDR R1, [R2, #8] ; Load word from address in R2 + 8 into R1 STR R5, [R3] ; Store R5 into address in R3

The key thing to note is that the memory access happens using the address obtained by adding the base register and offset. This provides flexibility in accessing different memory locations.

Cortex-M4 Pipeline Stages

To understand the timing of LDR/STR instructions, we need to first look at the pipeline stages of the Cortex-M4 processor. The pipeline consists of 3 main stages:

  1. Fetch – Instruction is fetched from memory
  2. Decode – Instruction is decoded into microoperations
  3. Execute – Instruction is executed

In addition, memory access instructions like LDR/STR have 2 extra stages:

  1. Memory – Address is sent to memory
  2. Writeback – Write data back to register

So in total 5 stages are involved for a memory access instruction. The stages are executed sequentially, so each stage takes one clock cycle to complete.

LDR Instruction Timing

When a LDR instruction is executed on the Cortex-M4, it goes through the following steps:

  1. Fetch – LDR instruction fetched from memory
  2. Decode – LDR instruction decoded into microops
  3. Execute – Address calculated using base register + offset
  4. Memory – Address sent to memory and word loaded
  5. Writeback – Loaded word written back to destination register

Since each stage takes 1 clock cycle, the total time taken is 5 clock cycles. So the timing diagram for a LDR instruction looks like: Cycle 1: Fetch Cycle 2: Decode Cycle 3: Execute Cycle 4: Memory Cycle 5: Writeback

So in summary, a LDR instruction takes 5 clock cycles to complete on the Cortex-M4.

STR Instruction Timing

The STR instruction timing is similar to LDR, with 5 pipeline stages:

  1. Fetch – STR instruction fetched
  2. Decode – STR decoded into microops
  3. Execute – Address calculated
  4. Memory – Address and data sent to memory
  5. Writeback – None

So again, the total time is 5 clock cycles. The timing diagram is: Cycle 1: Fetch Cycle 2: Decode Cycle 3: Execute Cycle 4: Memory Cycle 5: Writeback (no operation)

In summary, STR also takes 5 clock cycles to complete execution.

Load/Store Multiple Instructions

The LDM and STM instructions on Cortex-M4 allow transferring multiple words between memory and registers. For example: LDM R1!, {R2-R5} ; Load words into R2-R5 from address in R1 STM R3!, {R4-R8} ; Store R4-R8 into address in R3

These involve iterating the load/store operation multiple times. The timing depends on how many registers are being transferred:

  • 1 register = 5 cycles
  • 2 registers = 10 cycles
  • 3 registers = 15 cycles

And so on. So for N registers, the total time is 5N clock cycles.

Memory Wait States

The LDR/STR timing shown above assumes a single cycle memory access. However, accessing slower memories can require wait states. Cortex-M4 allows configuring 0-15 wait states for each memory region.

Each wait state inserts an additional stall cycle in the pipeline during the Memory stage. For example, with 3 wait states: Cycle 1: Fetch Cycle 2: Decode Cycle 3: Execute Cycle 4: Memory (Stall) Cycle 5: Memory (Stall) Cycle 6: Memory (Stall) Cycle 7: Memory Cycle 8: Writeback

So with N wait states, the total time becomes 5 + N clock cycles.

Other Factors Affecting Timing

There are some other considerations as well when looking at LDR/STR timing:

  • Pipeline interlocks can add stalls and increase timing.
  • Cache hits vs misses will affect the memory access time.
  • Bus contention from other masters can delay memory access.
  • Unaligned accesses may require extra cycles to handle.

So in a complex system, actual timings can vary quite a bit from the ideal scenarios described here. But this provides a baseline understanding to build upon.

Summary

Key points:

  • LDR and STR on Cortex-M4 take 5 cycles under ideal conditions.
  • Load/store multiple timing depends on number of registers.
  • Wait states can be added to account for slow memory.
  • Real-world timings are affected by many other factors.

By understanding the pipeline and how instructions flow through it, we can get a better idea of the Load/Store timing. This sets realistic performance expectations and also helps identify optimization opportunities in code.

Newsletter Form (#3)

More ARM insights right in your inbox

 


Share This Article
Facebook Twitter Email Copy Link Print
Previous Article Differences between Thumb and Thumb2 instruction sets
Next Article Cortex M4 Write Buffer Explained
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

2k Followers Like
3k Followers Follow
10.1k Followers Pin
- Sponsored-
Ad image

You Might Also Like

Pipelining Instructions After LDR vs STR on Cortex M4

When executing load (LDR) and store (STR) instructions on the…

6 Min Read

When to Use Intrinsics vs Assembler for Math Functions on Cortex-M4?

When programming for the Cortex-M4 chip, developers have a choice…

11 Min Read

Tips for Using the FPU on Cortex-M4 Efficiently

The Cortex-M4 processor includes a single precision floating point unit…

8 Min Read

Does arm cortex-M4 have stages of pipeline?

The Cortex-M4 processor from ARM does have a pipeline structure…

12 Min Read
SoCSoC
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
Welcome Back!

Sign in to your account