SoC
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
  • Arm Cortex M3
  • Contact
Reading: ARM Cortex-M0: Number of Clock Cycles for LDR Instruction
SUBSCRIBE
SoCSoC
Font ResizerAa
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Search
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Have an existing account? Sign In
Follow US
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
© S-O-C.ORG, All Rights Reserved.
Arm

ARM Cortex-M0: Number of Clock Cycles for LDR Instruction

Eileen David
Last updated: September 15, 2023 3:31 am
Eileen David 7 Min Read
Share
SHARE

The ARM Cortex-M0 is one of the most popular ARM processor cores used in microcontrollers and other embedded systems. It is an extremely energy efficient 32-bit RISC CPU that is widely used in IoT and wearable devices. One of the key things embedded systems developers want to know about any processor is – how many clock cycles does it take to execute various instructions? This determines performance benchmarks and how efficient algorithms can be coded. In this article, we will focus specifically on the number of clock cycles the LDR (Load Register) instruction takes on Cortex-M0.

Contents
What is the LDR Instruction in ARM Cortex-M0?Clock Cycles for LDR Instruction1. Loading from Flash Memory2. Loading from SRAM Memory3. Loading from Peripherals4. Cached vs Non-Cached Access5. Effect of SpeculationReal-World LDR PerformanceCoding Efficiently Using LDR Cycle CountsConclusion

What is the LDR Instruction in ARM Cortex-M0?

The LDR instruction in ARM Cortex-M0 and other ARM cores is used for loading data from memory into a register. The syntax for LDR is:

LDR Rd, [Rn, #offset]

Here,

  • Rd = Destination Register where the data is loaded
  • Rn = Base Register which contains the address from where data is loaded
  • Offset = An optional offset that can be added to the base address in Rn

For example:

LDR R1, [R2, #8]

This instruction loads the 32-bit data value from the memory address in R2 + 8 into R1. The offset value of 8 bytes is added to the address in R2 to get the final memory location.

Clock Cycles for LDR Instruction

The number of clock cycles taken by the LDR instruction in Cortex-M0 depends on a few factors:

  • Where the data is located (flash, SRAM, peripherals etc)
  • Alignment of data address
  • Optimization options like cache, speculation etc

Let’s go through the details one by one:

1. Loading from Flash Memory

Flash memory refers to internal flash integrated within microcontrollers. This is where the executable program code is stored. When loading data from flash using LDR, the following clock cycles are applicable on Cortex-M0:

  • Aligned Word Access – 3 clock cycles
  • Unaligned Word Access – 5 clock cycles
  • Halfword Access – 4 clock cycles
  • Byte Access – 5 clock cycles

As we can see, aligned word access is the fastest with just 3 clock cycles. A word is 32-bit or 4-bytes on ARM Cortex-M0. Aligned means the data address is a multiple of 4. Unaligned word access takes 5 clock cycles.

2. Loading from SRAM Memory

SRAM refers to internal SRAM integrated on most microcontrollers. SRAM provides faster access compared to Flash, and data can also be written to SRAM. The clock cycles for LDR from SRAM are:

  • Aligned Word Access – 1 clock cycle
  • Unaligned Word Access – 2 clock cycles
  • Halfword Access – 1 clock cycle
  • Byte Access – 2 clock cycles

As expected, SRAM provides the fastest access since it is directly coupled to the CPU without going through flash accelerators or external buses. Aligned word access takes only 1 clock cycle to load data from SRAM into a register using LDR.

3. Loading from Peripherals

Microcontrollers contain various integrated peripherals like timers, ADCs, UARTs, I2C etc. The ARM Cortex-M0 core can directly access these peripheral registers using LDR instructions. The number of cycles depends on the specific peripheral and chip implementation. But in general, peripheral register access takes 3 – 5 clock cycles for word access.

4. Cached vs Non-Cached Access

Some ARM Cortex-M0 chips include optional integrated cache to improve performance. When the LDR accesses cached memory regions, it usually takes:

  • Cached Aligned Word Access – 0 wait states
  • Non-cached Access – Same as no cache

So with caching enabled, repeated loads from cached addresses are extremely fast.

5. Effect of Speculation

Out-of-order speculation is a technique that enables parallel execution of instructions while assuming dependencies and branching are resolved correctly. This improves performance. Cortex-M0 does not support full out-of-order execution, but has limited speculative abilities when loading from flash. With branch prediction enabled, LDR from flash can complete in 2 cycles instead of 3 cycles in ideal cases.

Real-World LDR Performance

In real embedded applications, the LDR instruction access times will depend on exact memory regions, caching, speculation and other optimizations implemented in Cortex-M0 silicon. Typical real-world observations are:

  • Sequential LDR from flash takes 2 – 3 clock cycles per instruction due to prefetching
  • Random LDR from SRAM takes 1 – 2 clock cycles depending on optimization
  • Caching delivers on average 25% – 50% performance improvement for loads
  • Branch prediction improves sequential LDR further when branches are predicted successfully

So while benchmarks show ideal access timings, real embedded software will see some variation depending on data layout, caching, branching and instruction ordering.

Coding Efficiently Using LDR Cycle Counts

Knowing the LDR cycle timings on Cortex-M0 allows developers to write efficient code by:

  • Organizing data structures in SRAM vs Flash based on frequency of access
  • Aligning critical data structures in memory for optimal word access
  • Grouping loads together then operations to minimize stalls
  • Taking advantage of caches by reusing loaded data when possible
  • Writing code that has optimal branch behavior for speculation

Performance optimization on microcontrollers depends heavily on balancing memory access patterns, caching, and pipeline ordering. The number of cycles for basic instructions like LDR provides key insights for developers to optimize their Cortex-M0 code.

Conclusion

To summarize, the ARM Cortex-M0 LDR instruction takes 3-5 clock cycles to load from flash, 1-2 cycles from SRAM, and around 3-5 cycles from peripheral registers. Exact timings depend on memory alignment, caching, speculation and other optimizations implemented in the Cortex-M0 chip. Real-world timings will vary slightly. By understanding the basic cycle counts and their dependencies, developers can employ techniques like efficient data organization, alignment, reuse and branching behavior to optimize LDR access and performance of Cortex-M0 programs.

Newsletter Form (#3)

More ARM insights right in your inbox

 


Share This Article
Facebook Twitter Email Copy Link Print
Previous Article How does one do integer (signed or unsigned) division on ARM?
Next Article Does cortex M0 have floating point?
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

2k Followers Like
3k Followers Follow
10.1k Followers Pin
- Sponsored-
Ad image

You Might Also Like

Will The Arm Architecture Replace The X86/X64 Architecture?

The short answer is that while ARM is making inroads…

6 Min Read

What are the features and applications of ARM Cortex M3 processor?

The ARM Cortex M3 processor is a 32-bit microcontroller CPU…

10 Min Read

Armv8 Boot Sequence

When an Arm v8-based system powers on, it goes through…

7 Min Read

What is the purpose of the SysTick timer in ARM Cortex-M?

The SysTick timer is a key component in ARM Cortex-M…

9 Min Read
SoCSoC
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
Welcome Back!

Sign in to your account