Cortex-M3 Instruction Prefetching and Branch Prediction (Explained)

The Cortex-M3 CPU implements an instruction prefetch unit and branch prediction unit to improve performance by reducing stalls due to instruction fetches. This allows the CPU to execute instructions more efficiently by prefetching instructions before they are needed and predicting the outcomes of branches to avoid pipeline stalls.

Contents

Instruction Prefetch Unit Prefetch Buffer Enabling and Disabling Branch Prediction Prediction Algorithm Updating PC Prediction Performance Interaction Between Prefetch and Prediction Memory System Interface Bus Interface Memory Regions Reading PC PC Synchronization Effects on Timing Conclusion

Instruction Prefetch Unit

The instruction prefetch unit fetches instructions ahead of time before the instructions are required for execution. It contains a 16-byte instruction prefetch buffer that can hold up to 4 instructions. The prefetch unit continually fetches instructions sequentially from memory and fills up this buffer. This helps reduce stalls because the CPU can access these prefetched instructions immediately instead of having to fetch them from memory.

Whenever the CPU needs an instruction that is not present in the prefetch buffer, it triggers the prefetch unit to fetch the next set of instructions from memory while the current instruction is being decoded and executed. By the time the current instruction finishes executing, the next instructions are readily available in the prefetch buffer.

The instruction prefetch unit fetches code from the memory system in 16-byte blocks. It first fetches the 16 byte block containing the current PC. Once that fetch completes, it fetches the next contiguous 16 byte block, and so on as long as sequential execution continues. Thus it continuously prefetches up to 3 16-byte blocks ahead of the current PC.

The instruction prefetch buffer is filled on 16-byte boundaries regardless of instruction boundaries. This fetch scheme maximizes bus utilization efficiency for instruction fetches. The use of 16-byte prefetch blocks also matches the CPU’s 16-byte memory access granularity for maximum performance.

Prefetch Buffer

The 16-byte prefetch buffer holds prefetched instructions. It is filled from memory 16-bytes at a time. The buffer comprises 4 entries, each holding 4 bytes. It operates as a circular buffer with a read pointer and write pointer.

Whenever the CPU reads an instruction for execution, the read pointer advances. Whenever the prefetch unit brings in a new 16-byte block from memory, the write pointer advances. The pointers wrap around at the end of the buffer.

If the read pointer catches up to the write pointer, a buffer empty condition occurs. This causes a stall while the prefetch unit fetches the next 16-byte block from memory. So long as instructions are being consumed slower than the prefetch rate, the buffer will stay filled-up avoiding stalls.

The optimal prefetch distance ahead of the current PC is determined dynamically based on code behavior. When branches are taken or exceptions occur, the prefetched instructions are flushed and prefetch restarts from the new PC location.

Enabling and Disabling

The instruction prefetch unit is enabled by default out of reset. Software can disable and re-enable it by setting the appropriate bits in the Auxiliary Control Register. Disabling prefetch may save some power in an idle system.

When disabled, the processor will stall for each instruction fetch from memory. So performance will be optimal only with prefetching enabled.

Branch Prediction

The Cortex-M3 implements static branch prediction to avoid pipeline stalls due to branches. Any time the CPU encounters a conditional branch instruction, it must decide whether the branch will be taken or not taken. Without branch prediction, the CPU would have to wait until the branch condition is evaluated before it knows which instruction to fetch next. This leads to a one cycle stall for every branch.

With branch prediction, the CPU uses a fixed algorithm to predict if branches are taken or not taken. It assumes the prediction is correct and starts speculatively executing ahead. This avoids the stall if the prediction was right. Branch mispredictions incur a penalty when the CPU has to flush the pipeline and correct the speculative execution.

Prediction Algorithm

The Cortex-M3 uses a simple rule for static branch prediction. Backward branches are predicted as taken and forward branches as not taken. A backward branch is one where the target address is lower than the current PC. This predicts that loops will iterate and functions will return. Forward branches are predicted as not taken, meaning execution will continue sequentially. This minimizes prefetch buffer flushing on forward branches.

To determine the branch target address and direction, the CPU fully decodes the branch instruction through the decode stage. This decoded branch information feeds into the branch prediction unit.

The branch predictor assumes branches behave consistently. It does not detect or adapt to changing branch behavior over time. More advanced processors implement dynamic branch predictors that can learn and adapt to program behavior.

Updating PC

Based on the branch prediction, the predicted target address is fed to the instruction fetch unit to become the next PC. Fetches start occurring from the predicted address even before the branch instruction executes.

If later the prediction turns out wrong, the pipeline is flushed and the correct target address is used to restart fetching and execution. All the speculative instructions after the mispredicted branch are discarded.

Prediction Performance

The simple static branch predictor provides substantial performance gains over no prediction. Having no stalls on correctly predicted taken branches improves performance significantly. As most backward branches are taken loops, they get correctly predicted avoiding stalls.

The mispredict penalty is only 1 cycle since the branch target address was calculated in the decode stage. Even with mispredictions, the branch architecture avoids most stalls and reduces wasted instruction fetches.

In the Cortex-M3 pipeline, branch instructions can execute in 1 cycle with no stalls. This single cycle branch handling boosts real-time interrupt performance.

Interaction Between Prefetch and Prediction

The prefetch unit and branch predictor work together to keep the pipeline filled and executing efficiently.

When a branch is predicted taken, the target address is given to the prefetch unit which starts fetching from there. This avoids stalls following taken branches since target instructions get prefetched.

On a not taken prediction, sequential prefetching continues without interruption. Prefetched instructions may get flushed on a misprediction.

Whenever the branch outcome is resolved, the correct target address is updated. Any instructions misfetched due to misprediction get discarded from the pipeline stages and prefetch buffer.

Memory System Interface

The instruction prefetch unit interfaces with the memory system to fetch code. It is connected as a bus master to the Memory Interface Unit (MIU) which acts as the bus slave. The MIU handles all interface details with the Instruction Tightly Coupled Memory (ITCM) and Instruction bus.

The prefetch unit sends out 16-byte aligned fetch requests to the MIU. If the ITCM can service the request in a single cycle, the instructions get returned to the prefetch buffer without any stalls. This allows zero wait state instruction fetches when code resides in ITCM for maximum performance.

For external memory fetches, the MIU handles breaking up the request into transactions compliant with the external bus protocol. Wait states get inserted while fetching from slower memories. The prefetch unit and pipeline stall until the fetch request is serviced and data returned.

Bus Interface

The instruction interface connects to the system bus through the Memory Interface Unit. All bus protocol details are handled in the MIU hardware.

The Instruction bus is 32-bits wide. Instructions are fetched as word transfers using the non-sequential transfer type. Non-sequential transfers arbitrate for the bus each transfer, allowing other bus masters fair access.

The Cortex-M3 AHB-Lite protocol is used on the Instruction interface bus. This is a subset of the full AMBA AHB protocol optimized for the microcontroller environment.

Memory Regions

Up to 8 different memory regions can be present in the system for instruction accesses. The Memory Protection Unit (MPU) contains registers defining the memory regions. The MPU performs address checking and access permission checking for each instruction fetch.

If an instruction fetch violates the configured memory protection, a Memory Managemet Fault occurs. This causes exception entry to allow handling the access violation.

Reading PC

The PC value contains the address of the current instruction being executed. Reading the PC returns an address value pointing to the instruction in the Execute stage pipeline register.

In ARM documentation, this PC value is referred to as the Execute PC, or PC(Exe). The PC returned can be different from the Fetch PC value in the Instruction Fetch unit pipeline stage when reading during a taken branch.

PC Synchronization

The Pipeline includes synchronizing logic to handle PC updates correctly during taken branches and exceptions. This ensures the value read from PC is always synchronized to the current instruction state.

For regular sequential execution, the PC matches the Fetch and Execute stages. When branches occur, PC is updated from the target address. For exceptions, the vector address becomes the next PC.

In all cases, PC reads by instructions or via register access return the correct synchronized value corresponding to the instruction state.

Without this sync mechanism, PC could read incorrectly during taken branches or exceptions as prefetch causes the Fetch PC to diverge from the executing instruction stream.

Effects on Timing

Enabling the instruction prefetch unit and branch prediction has several positive effects on performance:

Branches can be handled in a single cycle avoiding stalls – This alone can improve performance over 20%
Loop iteration performance is boosted by steady delivery of instructions

Delays due to instruction cache misses get reduced
Prefetch efficiency over slower memory improves code execution speed
Overall CPI is reduced enhancing all code execution

The simplications in the Cortex-M3 pipeline allow very efficient branch handling. This enables short interrupt service routines to execute quickly.

However, instruction prefetching does have some negative effects:

Power consumption is increased due to continual instruction fetches

Memory bus traffic increases due to speculative prefetching
More firware testing required for prefetch-induced errata
Boundary code execution may suffer more stalls

So while prefetching improves average execution speed, it can marginally degrade worst-case interrupt latency and overall determinism. The CPU architecture allows prefetch to be disabled when lowest latency or power is critical.

Overall, the Cortex-M3 architecture shows how even simple prefetch and branch prediction techniques can substantially boost performance for embedded microcontrollers.

Conclusion

The Cortex-M3 implements an instruction prefetch unit and static branch prediction to improve real-world performance in microcontroller applications. Prefetching helps deliver instructions quickly to the pipeline to avoid stalls. Branch prediction avoids pipeline flushes on taken branches. Together these mechanisms reduce wasted cycles, improve bus utilization efficiency, and speed up code execution on average.

The prefetch unit fetches up to 3 16-byte blocks ahead of the current PC. It decouples instruction access from use to smooth out delays. The branch predictor assumes backward branches are taken allowing prefetch to continue. Forward branches are assumed not taken to avoid flushing useful prefetched instructions.

The simple static predictor avoids most stalls providing substantial gains for small silicon area and power cost. More advanced dynamic prediction techniques could boost performance further still. The Cortex-M3 architecture shows how even basic prefetch and prediction can deliver excellent results for deeply embedded systems.

Cortex-M3 Instruction Prefetching and Branch Prediction (Explained)

Instruction Prefetch Unit

Prefetch Buffer

Enabling and Disabling

Branch Prediction

Prediction Algorithm

Updating PC

Prediction Performance

Interaction Between Prefetch and Prediction

Memory System Interface

Bus Interface

Memory Regions

Reading PC

PC Synchronization

Effects on Timing

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

ARM Cortex M0 vs Arduino

Tips for New Cortex-M1 DesignStart Users

What is the SVC instruction in the arm cortex?

ARM Cortex M Configurations with Non-Native Endianness