Pipelining Instructions After LDR vs STR on Cortex M4

When executing load (LDR) and store (STR) instructions on the Cortex-M4, it is important to understand how pipelining works afterwards. The Cortex-M4 implements a 3-stage pipeline, so execution happens in fetch, decode, and execute stages. LDR and STR instructions can cause stalls and bubbles in this pipeline if subsequent instructions are not independent. Careful pipelining after LDR and STR is key for optimal performance.

Contents

The Cortex-M4 3-Stage Pipeline LDR and STR Instructions Pipelining After LDR Pipelining After STR Pipelining Principles Example Code Sequence Conclusion

The Cortex-M4 3-Stage Pipeline

The Cortex-M4 CPU implements a 3-stage pipeline consisting of fetch, decode, and execute stages. In the fetch stage, instructions are read from memory. In the decode stage, instructions are decoded into microoperations. Finally, in the execute stage, the microoperations are executed by the appropriate functional units.

Pipelining increases performance by allowing multiple instructions to be in different stages of execution simultaneously. However, dependencies between instructions can cause stalls and bubbles. Understanding these potential hazards is key to efficient pipelining.

LDR and STR Instructions

LDR and STR are load and store instructions in ARM architecture. LDR loads data from memory into a register, while STR stores data from a register into memory. These accesses have latency based on the memory system.

For example: LDR R1, [R2] ; Load value at address in R2 into R1 STR R3, [R4] ; Store value in R3 to address in R4

In the Cortex-M4, LDR and STR have a 3-cycle latency for memory with 0 wait states. This means subsequent instructions cannot use their results for 3 cycles.

Pipelining After LDR

When pipelining instructions after a LDR, we must consider data hazards. The loaded data is not available for 3 cycles, so any instruction that uses it as a source operand will stall the pipeline.

For example: LDR R1, [R2] ADD R3, R1, R4 ; Stall – R1 not ready yet!

To avoid stalls, independent instructions should be scheduled between the LDR and usage of the loaded register. Ideal instructions are those operating on different source registers: LDR R1, [R2] ADD R5, R6, R7 ; Independent – does not use R1 MUL R8, R9, R10 ; Independent – does not use R1 ADD R3, R1, R4 ; Now R1 is ready

At least 3 independent instructions should separate the LDR from consumption of its result. This prevents pipeline stalls.

Pipelining After STR

STR instructions present less pipelining challenges than LDR, since they do not produce a result that later instructions depend on. However, some considerations remain.

First, STRs should be separated from preceding instructions that set the stored register. For example: ADD R3, R1, R2 STR R3, [R4] ; Store newly calculated R3

Second, subsequent instructions should not modify the stored register until the STR completes: STR R3, [R4] ADD R3, R5, R6 ; BAD – Alters R3 too soon!

Finally, branches immediately after STR instructions may cause bubbles: STR R3, [R4] BEQ label ; Branch execution overlaps STR writeback

Ideal pipelining inserts independent instructions between the STR and any instructions consuming its inputs or branching: ADD R3, R1, R2 STR R3, [R4] ADD R7, R8, R9 ; Independent instruction CMP R10, #0 ; Independent instruction BEQ label ; Now safe to branch

Pipelining Principles

To summarize, optimal pipelining after LDR and STR on Cortex-M4 should follow these principles:

Insert at least 3 independent instructions between LDR and usage of its result
Separate STRs from instructions setting the stored register

Avoid modifying the stored register too soon after STR
Use independent instructions to prevent stalls and bubbles

Proper scheduling is key to maximize performance after loads and stores. By understanding these hazards, high throughput can be maintained on the Cortex-M4 pipeline.

Example Code Sequence

Here is an example code sequence illustrating efficient pipelining after LDR and STR on the Cortex-M4: // Load data LDR R1, [R2] // 3 independent instructions before using R1 ADD R3, R4, R5 MUL R6, R7, R8 ORR R9, R10, R11 // Now safe to use R1 ADD R12, R1, R3 // Store result STR R12, [R13] // 2 independent instructions before branching ANDS R14, R15, #1 LSLS R14, R14, #3 // Branch BEQ done

This schedules independent operations while data is loaded and stored to avoid stalls. The result is optimal utilization of the Cortex-M4 pipeline.

Conclusion

Pipelining load and store operations is key to maximizing Cortex-M4 performance. By understanding data hazards, stalls and bubbles can be avoided. Follow these best practices:

Separating LDR and the use of its result by 3+ independent instructions
Spacing STRs away from instructions setting the stored register
Using independent instructions to prevent hazards

Proper pipelining combines loads, stores, and other operations efficiently. With these guidelines, the 3-stage Cortex-M4 pipeline can achieve high instruction throughput, unlocking the full capabilities of the core.

Pipelining Instructions After LDR vs STR on Cortex M4

The Cortex-M4 3-Stage Pipeline

LDR and STR Instructions

Pipelining After LDR

Pipelining After STR

Pipelining Principles

Example Code Sequence

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

Techniques for Dealing with SysTick’s 24-bit Counter (Cortex-M4)

Reducing Load/Store Instruction Latency on Cortex M4

Cortex M4 Write Buffer Explained

Demystifying Cortex M4 LDR/STR Instruction Timing