Modified Harvard Architecture in ARM Cortex-M Chips

The ARM Cortex-M series of microcontroller chips utilize a modified Harvard architecture. This architecture separates instruction and data memories into distinct address spaces, while still allowing tight coupling between the two for flexibility. The separation provides several advantages in embedded systems such as deterministic instruction fetch, higher performance, and simpler memory subsystem design.

Contents

Overview of Harvard Architecture

The Harvard architecture is a computer architecture that separates the instruction and data memories into two distinct address spaces. This is in contrast to the Von Neumann architecture used in most modern computers, where instructions and data share the same memory space.

In a pure Harvard architecture, instruction and data memories are physically separate and instructions can only fetch from the instruction memory while data access is limited to the data memory. This strict separation provides complete isolation between the instruction and data streams.

Some of the key advantages of the Harvard architecture include:

Instruction fetch is deterministic since executable code cannot be modified
Instruction and data accesses can occur concurrently without interference

Instruction memory can be optimized for code density
Data memory can use wide interfaces optimized for throughput

However, strict Harvard architecture also imposes some limitations. For example, constants and lookup tables need to be duplicated in both instruction and data memory. Self-modifying code is not possible since instructions cannot write to the instruction memory.

Modified Harvard Architecture

The modified Harvard architecture aimed to get the best of both worlds by relaxing the strict separation between instruction and data memories. This allows data from the data memory to be read as instructions, providing flexibility for constants, lookup tables, and self-modifying code.

In the modified Harvard architecture, instruction and data memories remain logically separated. Fetches can only occur from the instruction memory, while stores can only modify the data memory. However, instructions are allowed to read data from the data memory using special instructions.

This improves performance by eliminating the need to duplicate read-only data like constants in both memories. It also enables new capabilities like self-modifying code, which can write new instructions into the data memory and execute them. However, deterministic instruction fetch and concurrent operation is no longer guaranteed since instruction reads can access the data memory.

ARM Cortex-M Architecture

The ARM Cortex-M series of processor cores implement a modified Harvard architecture optimized for embedded applications. The key architectural features include:

Logical separation of instruction and data memory into two distinct address spaces
Fetches can only occur from instruction memory

Stores can only modify data memory
Special instructions allow data reads from data memory
Separate instruction and data buses optimized for embedded SoCs

Optional unified memory map via Memory Protection Unit (MPU)

This provides software flexibility via the modified architecture while still retaining many of the performance benefits of strict Harvard. The logical separation is enforced by the processor hardware and does not rely on software conventions.

Instruction Memory

The instruction memory in Cortex-M processors is referred to as code memory. It contains all executable instructions and read-only constants used by the program. The size of code memory supported varies among Cortex-M variants but is typically at least 1 MB.

Code memory is accessed via the Instruction-Bus (I-Bus) which is optimized for instruction fetches. It typically supports higher peak throughput, greater address ranges, and additional features like prefetch buffers compared to the D-Bus.

To enable efficient instruction access, the code memory is usually implemented with high-density NOR flash, SRAM, or ROM in embedded systems. Slow memory types like NAND flash require special treatment to avoid stall cycles during instruction fetch.

Data Memory

The data memory in Cortex-M processors is referred to as SRAM memory. It holds global & local variables, stack space, heap memory and other mutable program data. The minimum supported SRAM size is only a few KB but can extend to hundreds of KB for larger Cortex-M variants.

SRAM memory is accessed via the Data-Bus (D-Bus) which is optimized for data throughput. It supports feature like burst transfers, wider bus widths, and DMA transfers to enable high data bandwidth.

For data storage, fast SRAM chips are typically used owing to the random access patterns. Slower memories require caches to avoid stalls during data access. External memory like SDRAM can also be added via the D-Bus.

Memory Map

By default, Cortex-M processors maintain logically separate instruction and data address spaces, referred to as code and SRAM memory maps. This allows independent addressing of up to 4GB each for code and SRAM.

However, the processor also supports an optional unified memory map view via the Memory Protection Unit (MPU). This allows parts of SRAM to be mapped into the instruction address space, enabling mixed instruction and data access. The MPU preserves security by restricting which SRAM regions are accessible as instructions.

Harvard vs Modified Harvard in Cortex-M

The modified Harvard architecture in Cortex-M processor provides a good balance of flexibility and performance.

Compared to a pure Harvard implementation, the modifications introduce additional complexity in the memory subsystem and interfaces. Instruction fetch is no longer deterministic and needs to account for variable latency data memory access. The buses must now support both instruction and data traffic.

However, the benefits outweigh these costs for most embedded applications. Self-modifying code can be used for efficient table-based algorithms. Constants and lookup tables can be stored just once in SRAM avoiding duplicates. Performance is enhanced by eliminating unnecessary memory transfers.

Overall the architecture enables high code density, excellent performance and flexibility vital for the tight memory constraints and real-time requirements of embedded systems. It has proven very successful across the wide adoption of Cortex-M processors in IoT and edge devices.

Instruction Access to Data Memory

While Cortex-M processors keep instruction and data memories logically separate, instructions can directly access data memory contents using special instructions.

These instructions enable a read from the data memory location mapped to the specified address. The data is returned in an operand register instead of being executed as an instruction.

Some examples include:

LDR – Load data memory contents into a register

LDRH – Load 16-bit data memory contents into a register
LDRSB – Load 8-bit signed data memory contents into a register
LDRSH – Load 16-bit signed data memory contents into a register

This allows constants, tables and other read-only data to be stored just once in the data memory. The processor can access them as data via the D-Bus when required instead of needing dedicated copies in instruction memory.

Access to mutable data memory also enables self-modifying code. The program can modify data memory contents and then execute them as instructions later. This allows efficient table-based algorithms for operations like trigonometric, logarithmic and floating point functions.

Implications

Granting instruction access to data memory has the following implications:

Data memory reads require coordination between I and D buses impacting performance
Timing-sensitive instruction fetches can no longer assume constant latency
Bus arbitration and interconnects require added logic to handle concurrent I and D transfers

Modified Harvard architecture loses some determinism advantages compared to pure Harvard

However, for most embedded applications the benefits outweigh these effects. The Cortex-M architecture includes features to manage the coordinated data access efficiently.

Constant Data in Cortex-M

Constant data like strings, tables and other read-only data are extensively used in embedded programs. The modified Harvard architecture in Cortex-M enables efficient storage for such constants.

Constant data can be located either in code memory or SRAM memory. Placing it in code memory eliminates external memory access but wastes precious instruction space. Locating it in SRAM requires special loads from data memory.

Code Memory Constants

Read-only constants can be stored directly in the instruction memory alongside code:

Efficient when small amounts of constant data are required

Read access is simple and fast as the constant has an instruction address
No SRAM access is required eliminating contention for data bus
Wastes limited instruction memory reducing program size capacity

Typical usage includes small tables, enum values, and short strings. The compiler performs constant pooling to locate identical constants just once.

SRAM Constants

Large readonly data structures are generally located in data SRAM:

Avoids bloating instruction memory size

Requires data memory access on constant read increasing latency
Can introduce D-Bus contention with instruction fetches

This approach is used for sizable tables, long strings and other large constant arrays. Special load instructions fetch the constants from SRAM when used.

Implications

The modified Harvard architecture enables both code and SRAM memory to store constants efficiently. This provides software great flexibility to optimize constant data placement for performance and memory utilization in embedded designs.

Self-Modifying Code

Self-modifying code refers to instructions in a program that can modify other instructions stored in memory at runtime, and execute those modified instructions later.

Self-modifying code is not possible in pure Harvard architectures since instruction memory is read-only. However, the modified Harvard architecture in Cortex-M processors permits self-modifying code by allowing instructions to write and read data memory.

Uses of Self-Modifying Code

Some common uses of self-modifying code include:

Lookup tables – Values encoded as instructions can be modified based on context
Function pointers – Change target address stored in instruction

Patching code – Fixes and updates can modify original instructions
Compression – Decompress instructions on the fly by modifying decoded instructions
Obfuscation – Security technique to decrypt scrambled instructions at runtime

This technique is extensively used to implement table-driven algorithms where the tables are encoded as modifiable instructions rather than data. Example applications include trigonometric, floating point and logarithmic functions.

Challenges

Key challenges with self-modifying code on Cortex-M include:

Requires flushing pipelines and caches after modifying instructions

Makes timing analysis difficult due to variable latency instruction fetch
Imposes software challenges for concurrency and reentrancy
Can complicate testing and debug due to non-determinism

As a result, self-modifying code is best utilized for niche applications rather than pervasively in Cortex-M programs.

Performance Optimization

The modified Harvard architecture in Cortex-M enables several performance optimization techniques by allowing instruction access to data memory.

Lookup Table Optimization

Lookup tables can be encoded into data memory as instructions and accessed via special load instructions. This avoids duplicates in instruction memory.

Example:

LDR R1, =Table // R1 = memory address of Table 
LDR R2, [R1, #offset] // R2 = value loaded from Table

Benefits include:

Saves code size by storing tables in data memory

Provides data bus bandwidth since lookups don’t waste instruction fetches
Enables self-modifying optimization where values in tables can be modified at runtime

Instruction Stream Compression

Self-modifying code can be used to decompress instructions on the fly saving code size:

Store compressed instructions in memory
Decompress into RAM at runtime and execute from there
Saves storage for instruction memory

Increased processor demand to perform decompression

Instruction Cache Optimized Fetch

Program locality can be improved by rearranging instruction order to optimize instruction cache performance. Useful for loops and branches.

Example:

LDR R0, =LoopInsn // Load address of LoopInsn block
BX R0 // Branch to LoopInsn
LoopInsn:
   ... // Loop body instructions

Benefits:

Arranges instructions for optimal caching
Avoids conflict misses by spacing out loops and branches

Hide fetch latency via instruction interleaving

Memory Management

The Cortex-M memory architecture requires careful management of the instruction and data memory spaces for optimal performance.

Split Memory Maps

By default code and SRAM memory have separate address maps in Cortex-M processors. The processor handles all address translations required between the unified logical view used by software and the distinct physical memories.

Benefits:

Software is abstracted from physical separation
Logical memory regions can be mapped to any physical memory address

Allows reuse of physical memories across address spaces

Memory Protection Unit

The Memory Protection Unit (MPU) provides an optional unified memory map view:

Maps parts of SRAM into instruction address space

Enables mixed instruction and data access to regions of SRAM
Preserves security by controlling SRAM code access

Useful for:

Self-modifying code execution
Accessing constant data in SRAM memory
Shared memory communication between threads

Caching and Prefetching

Caches and prefetch buffers help overcome the latency gap between processor and memory:

Exploit locality and sequential access patterns
Instruction caches reduce average instruction fetch time

Prefetch buffers hide stall cycles on instruction misses

These techniques are essential to leverage full processor throughput, especially when slower memories are used.

ARMv6-M and ARMv7-M Architectures

ARM Cortex-M processors are available in two architecture variants – ARMv6-M and ARMv7-M. They provide different capabilities:

ARMv6-M Architecture

Minimalist architecture optimized for low cost MCUs
Very compact – Just 16 32-bit base instructions
Efficient 16-bit Thumb instruction set encoding

Only includes basic DSP extensions
Ideal for simple embedded applications like sensors and IoT nodes

ARMv7-M Architecture

Enhanced architecture with full DSP/floating-point support

Similar base instruction set as ARMv6-M
Adds comprehensive DSP extensions for vector operations
Includes single and double precision floating point unit

Suited for industrial control, motor drives, automation etc.

Cortex-M Processor Variants

ARM offers a wide range of Cortex-M processor variants targeting diverse performance points:

Ultra-low Power Variants

Cortex-M0+ – 32-bit CPU optimized for lowest cost and power

Cortex-M1 – Older 32-bit predecessor to Cortex-M0+
Cortex-M23 – M0+ variant withArm v8-M security extensions
Cortex-M33 – M0+ variant with TrustZone, Floating Point Unit

Target extremely cost and power sensitive applications like sensors, wearables etc. Very compact embedded code footprint suitable even for 8/16-bit replacements.

Mainstream Low-power Variants

Cortex-M3 – Older mainstream 32-bit performance variant
Cortex-M4 – Mainstream application processor with DSP extensions

Cortex-M7 – Highest performance MCU optimized for real-time applications

Balances power efficiency with high performance. Widely used in embedded IoT, industrial, consumer and automotive applications.

High-Performance Application Processors

Cortex-R4 and Cortex-R5 – Real-time microprocessors for safety-critical

Cortex-R4F and Cortex-R5F – R4/R5 with lock-step redundancy
Cortex-R52 – Dual-core R5 for functional safety

Designed for reliability and functional safety requirements of motor control, industrial transport, robotics etc. Lock-step redundant capability for up to Safety Integrity Level (SIL) 3.

Conclusion

The modified Harvard architecture adopted by Cortex-M processors provides an optimized balance of performance and flexibility for embedded systems. Logical separation of instruction and data memory enables deterministic instruction fetch, concurrent access and benefits such as self-modifying code. At the same time, careful integration allows tight coupling between the two address spaces. Features like the MPU build on this solid foundation to enable customizable system implementations.

ARM has leveraged this memory architecture across an extensive range of Cortex-M processor variants from ultra low-power to high-performance. Their widespread adoption in IoT, industrial, automotive and consumer applications is a testament to the efficiency and versatility of the underlying modified Harvard architecture.

Modified Harvard Architecture in ARM Cortex-M Chips

Overview of Harvard Architecture

Modified Harvard Architecture

ARM Cortex-M Architecture

Instruction Memory

Data Memory

Memory Map

Harvard vs Modified Harvard in Cortex-M

Instruction Access to Data Memory

Implications

Constant Data in Cortex-M

Code Memory Constants

SRAM Constants

Implications

Self-Modifying Code

Uses of Self-Modifying Code

Challenges

Performance Optimization

Lookup Table Optimization

Instruction Stream Compression

Instruction Cache Optimized Fetch

Memory Management

Split Memory Maps

Memory Protection Unit

Caching and Prefetching

ARMv6-M and ARMv7-M Architectures

ARMv6-M Architecture

ARMv7-M Architecture

Cortex-M Processor Variants

Ultra-low Power Variants

Mainstream Low-power Variants

High-Performance Application Processors

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

What is Serial Wire Viewer (SWV) in Arm Cortex-M?

Flash Patch and Breakpoint Unit (FPB) in Arm Cortex-M Explained

Arm Cortex-M DAP bus and interconnect architecture Explained

Controlling Clocks and PLL for Power Savings in Cortex-M3