Cortex M4 Write Buffer Explained

The Cortex-M4 processor includes a write buffer to improve performance when writing data to memory. The write buffer allows the processor to collect multiple writes into a buffer before committing them to memory. This increases efficiency by reducing the number of individual write transactions.

Contents

Why Use a Write Buffer?

Without a write buffer, the processor would have to perform a write to memory for every store instruction. This requires setting up the address and data for each write, then waiting for the write transaction to complete before moving on. With a buffer, the processor can collect multiple writes together and commit them all at once.

For example, consider a sequence of instructions like: STR R1, [R2] STR R3, [R4] STR R5, [R6]

Without a buffer, each store would require a separate write transaction. With a 4-entry write buffer, all 3 writes can be collected in the buffer and committed in a single transaction. This improves performance by minimizing the number of individual write transactions.

Write Buffer Operation

The Cortex-M4 write buffer operates by collecting write data and addresses as store instructions are executed. The writes are held temporarily in a FIFO buffer. When the buffer becomes full or a context switch occurs, the buffered writes are drained by performing a burst write to memory.

The key characteristics of the Cortex-M4 write buffer are:

4-entry FIFO buffer
Writes up to 4 words per entry
Only buffers writes, not reads

Drains on buffer full or context switch

The buffer entries hold both the data to be written and the address it should be written to. Each entry can buffer up to 4 words (16 bytes) of data. The processor can continue executing instructions while the buffer collects writes.

When the buffer fills up, or when a context switch occurs, the processor stalls while the buffered writes are drained. The writes are drained by performing a burst write transaction to commit all the collected data to memory at once. This avoids having to commit each write separately.

Write Buffer Enable Control

The Cortex-M4 write buffer is enabled by default out of reset. It can be explicitly enabled/disabled using the WRITEBUFFER bit in the Auxiliary Control Register (ACTRL). // Enable write buffer SET ACTRL.WRITEBUFFER = 1 // Disable write buffer SET ACTRL.WRITEBUFFER = 0

Disabling the write buffer forces the processor to perform a write transaction for every store instruction, reducing efficiency but ensuring writes complete immediately.

Enabling the write buffer provides better performance for bulk writes. But data remains buffered until the buffer drains, so individual writes may not become visible in memory until later.

Write Buffer Draining

The Cortex-M4 write buffer drains and commits buffered writes to memory when:

The buffer becomes full
A context switch/exception occurs

Execution reaches an ISB instruction
The processor goes to Sleep or Deep Sleep mode

Reaching any of these conditions will trigger the write buffer to drain. The processor stalls while the buffered writes are committed to memory using a burst transaction.

This ensures that the buffered data gets written to memory in a timely manner. The processor does not stall until necessary, minimizing impact on performance.

Buffer Full

The write buffer will drain automatically whenever it becomes full. This occurs when:

All 4 entries are valid

A single entry holds the maximum of 4 words (16 bytes)

Draining on full allows the buffer to collect as many writes as possible between drains. But data will not be left buffered indefinitely.

Context Switch

The processor drains the buffer on any context switch or exception entry. This ensures that all writes from the current context are visible before switching to a new context.

For example, the buffer will drain when:

Switching threads
Taking an IRQ, fault, or exception

Returning from an exception

Draining on context switches prevents data from being lost or corrupted across contexts.

ISB Instruction

Including an ISB instruction will also force the write buffer to drain before continuing. This allows software to force pending writes to complete if necessary.

For example: STR R1, [R2] // Buffer write ISB // Drain buffer LDR R3, [R4] // Read can see prior write

The ISB acts as a synchronization point to ensure the buffered store is committed before executing any subsequent loads.

Sleep Mode

The buffer will also drain when entering Sleep or Deep Sleep low power modes. This ensures memory consistency before suspending execution.

Write Buffer Hazards

Because the write buffer stores data temporarily, it can cause hazards where later reads appear to execute before prior writes. There are two main hazards to be aware of:

Read after Write (RAW) Hazard
Write after Write (WAW) Hazard

Careful use of barrier instructions can avoid these issues by forcing completion of buffered writes.

Read after Write Hazard

A RAW hazard occurs when a read appears to occur before a prior buffered write. For example: STR R1, [R2] // Buffer write A LDR R3, [R2] // Read old value from A

Here the read of [R2] may occur before the buffered write from the STR has completed. This can lead to reading an old stale value.

RAW hazards can be avoided by placing an ISB barrier between the write and read: STR R1, [R2] // Buffer write ISB LDR R3, [R2] // Will see updated value

The ISB flushes the buffer to ensure the write completes first.

Write after Write Hazard

A WAW hazard occurs when a buffered write gets stale data overwritten by a later write. For example: STR R1, [R2] // Buffer write A STR R2, [R2] // Overwrites A in buffer

Here the second write can overwrite the first write before it completes. This results in old data from the first write being lost.

WAW hazards can be avoided by placing an ISB between writes: STR R1, [R2] // Buffer write A ISB // Flush buffer STR R2, [R2] // Write B comes after A

The ISB drains the first write before accepting the second write. This prevents stale data issues.

Write Buffer Bypass

Certain operations bypass the write buffer and force data to be immediately written to memory. Bypassing avoids the hazards associated with buffering.

The following always bypass the write buffer:

Writes to memory-mapped peripherals

Store-release instructions using LDREX/STREX
Writes to Strongly-ordered memory
Writes to Device memory

Bypassing is required to ensure synchronization for memory-mapped devices and special memory regions. Normal writes to RAM will still utilize the buffer.

Memory-Mapped Registers

Writes to registers of memory-mapped peripherals are always committed immediately. The write propagates to the device regardless of buffer state.

This ensures that side effects of register writes occur in order. Buffering device registers could lead to loss of synchronization.

Release/Acquire Instructions

Load-Acquire and Store-Release instructions used for synchronization also bypass the buffer.

For example, LDREX/STREX instructions avoid the possibility of incorrectly reading stale data because the STREX write is immediately visible.

Strongly-Ordered Memory

Writes to Strongly-ordered memory regions bypass the buffer. Strongly-ordered access guarantees that each access completes in order.

Buffering is disabled to meet the sequential access requirements of Strongly-ordered memory.

Device Memory

The buffer is also disabled for writes to any memory region configured as Device memory. This includes external memory regions set up as Device memory.

Bypassing reduces latency and ensures external Device memory sees writes immediately.

Write Buffer and DMA

When using DMA, it’s important to understand the interactions with the write buffer. Key points include:

DMA writes bypass the CPU write buffer
DMA reads can occur before buffered writes are visible

Cache maintenance may be needed around DMA reads/writes

These hazards require careful management of the buffers and caches when using DMA.

DMA Writes Bypass Buffer

DMA writes directly to memory bypass the CPU’s write buffer. This is necessary to prevent overwriting stale buffered data.

For example, consider CPU writes followed by a DMA write: STR R1, [R2] // CPU write buffered STR R3, [R4] // CPU write buffered DMA_WRITE [R2], R5 // DMA write not buffered DMA complete

The DMA write to [R2] will overwrite the buffered CPU write immediately. Using the DMA bypass is the only way to guarantee coherence.

DMA Reads May See Old Data

DMA reads can occur before prior CPU writes have completed from the buffer. So stale data may be read.

For example: STR R1, [R2] // CPU write buffered DMA_READ R5, [R2] // May read old value! DMA complete

Placing an ISB before starting DMA reads can avoid this issue by flushing the buffer first.

Cache Maintenance

Cache maintenance may be needed around DMA reads/writes depending on the cache settings for the region.

For memory regions with caching enabled, operations like DATA SYNC BARRIERS and CACHE CLEAN/INVALIDATE may be required. This ensures cache and buffer contents are coherent with memory for DMA.

Refer to the Cortex-M4 TRM for full details on DMA and cache maintenance requirements.

Write Buffer Delay

The Cortex-M4 write buffer introduces some delay in write completion. The total delay depends on:

Buffer depth
Bus arbitration time
Burst length

The buffer depth determines how long a write can be held before draining. Arbitration and burst length add delays when draining.

Buffer Depth

The 4-entry buffer allows writes to be delayed for multiple subsequent writes. A write could be held in the buffer for up to 3 additional writes before draining.

So the buffer alone can delay a write by the time to execute up to 3 additional store instructions.

Arbitration Time

When draining, the processor must arbitrate for bus access before writing data to memory. This arbitration delay depends on the bus loading from other bus masters.

Heavier bus utilization increases arbitration time, adding to the write delay.

Burst Length

Draining utilizes a burst write transaction. The length depends on buffer contents:

1-4 words for a single entry
Up to 16 words max for a full buffer

The burst length determines the memory access time required to complete the drain. Longer bursts require more time to fully commit to memory.

Write Buffer Performance

The Cortex-M4 write buffer improves performance for workloads involving sequential writes. Benefits include:

Reducing total number of write transactions
Converting short writes to longer bursts

Absorbing write latency by buffering

Exact performance gains depend on the application and memory system characteristics.

Reducing Transactions

Buffering writes reduces the total number of write transactions required. This saves overhead cycles required for each transaction.

For example, buffering 4 sequential word writes reduces 4 single transactions down to just 1 burst transaction. This saves on address setup and handshaking overhead.

Burst Writes

The write buffer allows converting short sequential writes into a longer burst. Burst writes are more efficient than single writes.

Many memories can accept bursts much more quickly than single writes to sequential addresses.

Absorbing Latency

The write buffer allows the processor to absorb some amount of write latency by buffering rather than stalling. This helps mask slow memory speeds.

Buffering writes helps prevent write transactions on the bus from immediately blocking execution. The processor can continue while the buffer collects more writes.

Optimizing for Write Buffer

Software can be optimized to make best use of the write buffer. Some techniques include:

Group sequential writes together
Minimize context switches and ISB instructions
Tune burst lengths for memory system

Group Writes

The buffer works best when sequential writes are grouped together in blocks without intervening reads. This allows the buffer to fill completely before draining.

Code should be structured to batch writes together as much as possible.

Minimize Drains

Context switches and ISB instructions force the buffer to drain, stalling the processor. Minimizing unnecessary context switches allows better write throughput.

Avoid using ISB instructions unless necessary to prevent buffer-related hazards.

Tune Burst Lengths

Larger burst writes achieve higher throughput on many memory systems. But over-long bursts can also waste bus cycles.

Tuning code to generate burst lengths that match the optimal performance profile of the target memory can help maximize efficiency.

Conclusion

The Cortex-M4 write buffer is an important performance optimization for workloads involving sequential writes. Proper usage can help improve throughput and mask write latencies.

However, the write buffer also introduces hazards that software must properly synchronize using barriers and cache maintenance operations. Failing to handle these hazards correctly can lead to memory ordering issues.

Understanding the detailed operation of the buffer helps software make best use of it while avoiding pitfalls. Careful buffer management allows software to extract maximum performance from the Cortex-M4 processor.