The ARM Cortex-M series of microcontroller chips utilize a modified Harvard architecture. This architecture separates instruction and data memories into distinct address spaces, while still allowing tight coupling between the two for flexibility. The separation provides several advantages in embedded systems such as deterministic instruction fetch, higher performance, and simpler memory subsystem design.
Overview of Harvard Architecture
The Harvard architecture is a computer architecture that separates the instruction and data memories into two distinct address spaces. This is in contrast to the Von Neumann architecture used in most modern computers, where instructions and data share the same memory space.
In a pure Harvard architecture, instruction and data memories are physically separate and instructions can only fetch from the instruction memory while data access is limited to the data memory. This strict separation provides complete isolation between the instruction and data streams.
Some of the key advantages of the Harvard architecture include:
- Instruction fetch is deterministic since executable code cannot be modified
- Instruction and data accesses can occur concurrently without interference
- Instruction memory can be optimized for code density
- Data memory can use wide interfaces optimized for throughput
However, strict Harvard architecture also imposes some limitations. For example, constants and lookup tables need to be duplicated in both instruction and data memory. Self-modifying code is not possible since instructions cannot write to the instruction memory.
Modified Harvard Architecture
The modified Harvard architecture aimed to get the best of both worlds by relaxing the strict separation between instruction and data memories. This allows data from the data memory to be read as instructions, providing flexibility for constants, lookup tables, and self-modifying code.
In the modified Harvard architecture, instruction and data memories remain logically separated. Fetches can only occur from the instruction memory, while stores can only modify the data memory. However, instructions are allowed to read data from the data memory using special instructions.
This improves performance by eliminating the need to duplicate read-only data like constants in both memories. It also enables new capabilities like self-modifying code, which can write new instructions into the data memory and execute them. However, deterministic instruction fetch and concurrent operation is no longer guaranteed since instruction reads can access the data memory.
ARM Cortex-M Architecture
The ARM Cortex-M series of processor cores implement a modified Harvard architecture optimized for embedded applications. The key architectural features include:
- Logical separation of instruction and data memory into two distinct address spaces
- Fetches can only occur from instruction memory
- Stores can only modify data memory
- Special instructions allow data reads from data memory
- Separate instruction and data buses optimized for embedded SoCs
- Optional unified memory map via Memory Protection Unit (MPU)
This provides software flexibility via the modified architecture while still retaining many of the performance benefits of strict Harvard. The logical separation is enforced by the processor hardware and does not rely on software conventions.
Instruction Memory
The instruction memory in Cortex-M processors is referred to as code memory. It contains all executable instructions and read-only constants used by the program. The size of code memory supported varies among Cortex-M variants but is typically at least 1 MB.
Code memory is accessed via the Instruction-Bus (I-Bus) which is optimized for instruction fetches. It typically supports higher peak throughput, greater address ranges, and additional features like prefetch buffers compared to the D-Bus.
To enable efficient instruction access, the code memory is usually implemented with high-density NOR flash, SRAM, or ROM in embedded systems. Slow memory types like NAND flash require special treatment to avoid stall cycles during instruction fetch.
Data Memory
The data memory in Cortex-M processors is referred to as SRAM memory. It holds global & local variables, stack space, heap memory and other mutable program data. The minimum supported SRAM size is only a few KB but can extend to hundreds of KB for larger Cortex-M variants.
SRAM memory is accessed via the Data-Bus (D-Bus) which is optimized for data throughput. It supports feature like burst transfers, wider bus widths, and DMA transfers to enable high data bandwidth.
For data storage, fast SRAM chips are typically used owing to the random access patterns. Slower memories require caches to avoid stalls during data access. External memory like SDRAM can also be added via the D-Bus.
Memory Map
By default, Cortex-M processors maintain logically separate instruction and data address spaces, referred to as code and SRAM memory maps. This allows independent addressing of up to 4GB each for code and SRAM.
However, the processor also supports an optional unified memory map view via the Memory Protection Unit (MPU). This allows parts of SRAM to be mapped into the instruction address space, enabling mixed instruction and data access. The MPU preserves security by restricting which SRAM regions are accessible as instructions.
Harvard vs Modified Harvard in Cortex-M
The modified Harvard architecture in Cortex-M processor provides a good balance of flexibility and performance.
Compared to a pure Harvard implementation, the modifications introduce additional complexity in the memory subsystem and interfaces. Instruction fetch is no longer deterministic and needs to account for variable latency data memory access. The buses must now support both instruction and data traffic.
However, the benefits outweigh these costs for most embedded applications. Self-modifying code can be used for efficient table-based algorithms. Constants and lookup tables can be stored just once in SRAM avoiding duplicates. Performance is enhanced by eliminating unnecessary memory transfers.
Overall the architecture enables high code density, excellent performance and flexibility vital for the tight memory constraints and real-time requirements of embedded systems. It has proven very successful across the wide adoption of Cortex-M processors in IoT and edge devices.
Instruction Access to Data Memory
While Cortex-M processors keep instruction and data memories logically separate, instructions can directly access data memory contents using special instructions.
These instructions enable a read from the data memory location mapped to the specified address. The data is returned in an operand register instead of being executed as an instruction.
Some examples include:
- LDR – Load data memory contents into a register
- LDRH – Load 16-bit data memory contents into a register
- LDRSB – Load 8-bit signed data memory contents into a register
- LDRSH – Load 16-bit signed data memory contents into a register
This allows constants, tables and other read-only data to be stored just once in the data memory. The processor can access them as data via the D-Bus when required instead of needing dedicated copies in instruction memory.
Access to mutable data memory also enables self-modifying code. The program can modify data memory contents and then execute them as instructions later. This allows efficient table-based algorithms for operations like trigonometric, logarithmic and floating point functions.
Implications
Granting instruction access to data memory has the following implications:
- Data memory reads require coordination between I and D buses impacting performance
- Timing-sensitive instruction fetches can no longer assume constant latency
- Bus arbitration and interconnects require added logic to handle concurrent I and D transfers
- Modified Harvard architecture loses some determinism advantages compared to pure Harvard
However, for most embedded applications the benefits outweigh these effects. The Cortex-M architecture includes features to manage the coordinated data access efficiently.
Constant Data in Cortex-M
Constant data like strings, tables and other read-only data are extensively used in embedded programs. The modified Harvard architecture in Cortex-M enables efficient storage for such constants.
Constant data can be located either in code memory or SRAM memory. Placing it in code memory eliminates external memory access but wastes precious instruction space. Locating it in SRAM requires special loads from data memory.
Code Memory Constants
Read-only constants can be stored directly in the instruction memory alongside code:
- Efficient when small amounts of constant data are required
- Read access is simple and fast as the constant has an instruction address
- No SRAM access is required eliminating contention for data bus
- Wastes limited instruction memory reducing program size capacity
Typical usage includes small tables, enum values, and short strings. The compiler performs constant pooling to locate identical constants just once.
SRAM Constants
Large readonly data structures are generally located in data SRAM:
- Avoids bloating instruction memory size
- Requires data memory access on constant read increasing latency
- Can introduce D-Bus contention with instruction fetches
This approach is used for sizable tables, long strings and other large constant arrays. Special load instructions fetch the constants from SRAM when used.
Implications
The modified Harvard architecture enables both code and SRAM memory to store constants efficiently. This provides software great flexibility to optimize constant data placement for performance and memory utilization in embedded designs.
Self-Modifying Code
Self-modifying code refers to instructions in a program that can modify other instructions stored in memory at runtime, and execute those modified instructions later.
Self-modifying code is not possible in pure Harvard architectures since instruction memory is read-only. However, the modified Harvard architecture in Cortex-M processors permits self-modifying code by allowing instructions to write and read data memory.
Uses of Self-Modifying Code
Some common uses of self-modifying code include:
- Lookup tables – Values encoded as instructions can be modified based on context
- Function pointers – Change target address stored in instruction
- Patching code – Fixes and updates can modify original instructions
- Compression – Decompress instructions on the fly by modifying decoded instructions
- Obfuscation – Security technique to decrypt scrambled instructions at runtime
This technique is extensively used to implement table-driven algorithms where the tables are encoded as modifiable instructions rather than data. Example applications include trigonometric, floating point and logarithmic functions.
Challenges
Key challenges with self-modifying code on Cortex-M include:
- Requires flushing pipelines and caches after modifying instructions
- Makes timing analysis difficult due to variable latency instruction fetch
- Imposes software challenges for concurrency and reentrancy
- Can complicate testing and debug due to non-determinism
As a result, self-modifying code is best utilized for niche applications rather than pervasively in Cortex-M programs.
Performance Optimization
The modified Harvard architecture in Cortex-M enables several performance optimization techniques by allowing instruction access to data memory.
Lookup Table Optimization
Lookup tables can be encoded into data memory as instructions and accessed via special load instructions. This avoids duplicates in instruction memory.
Example:
LDR R1, =Table // R1 = memory address of Table
LDR R2, [R1, #offset] // R2 = value loaded from Table
Benefits include:
- Saves code size by storing tables in data memory
- Provides data bus bandwidth since lookups don’t waste instruction fetches
- Enables self-modifying optimization where values in tables can be modified at runtime
Instruction Stream Compression
Self-modifying code can be used to decompress instructions on the fly saving code size:
- Store compressed instructions in memory
- Decompress into RAM at runtime and execute from there
- Saves storage for instruction memory
- Increased processor demand to perform decompression
Instruction Cache Optimized Fetch
Program locality can be improved by rearranging instruction order to optimize instruction cache performance. Useful for loops and branches.
Example:
LDR R0, =LoopInsn // Load address of LoopInsn block
BX R0 // Branch to LoopInsn
LoopInsn:
... // Loop body instructions
Benefits:
- Arranges instructions for optimal caching
- Avoids conflict misses by spacing out loops and branches
- Hide fetch latency via instruction interleaving
Memory Management
The Cortex-M memory architecture requires careful management of the instruction and data memory spaces for optimal performance.
Split Memory Maps
By default code and SRAM memory have separate address maps in Cortex-M processors. The processor handles all address translations required between the unified logical view used by software and the distinct physical memories.
Benefits:
- Software is abstracted from physical separation
- Logical memory regions can be mapped to any physical memory address
- Allows reuse of physical memories across address spaces
Memory Protection Unit
The Memory Protection Unit (MPU) provides an optional unified memory map view:
- Maps parts of SRAM into instruction address space
- Enables mixed instruction and data access to regions of SRAM
- Preserves security by controlling SRAM code access
Useful for:
- Self-modifying code execution
- Accessing constant data in SRAM memory
- Shared memory communication between threads
Caching and Prefetching
Caches and prefetch buffers help overcome the latency gap between processor and memory:
- Exploit locality and sequential access patterns
- Instruction caches reduce average instruction fetch time
- Prefetch buffers hide stall cycles on instruction misses
These techniques are essential to leverage full processor throughput, especially when slower memories are used.
ARMv6-M and ARMv7-M Architectures
ARM Cortex-M processors are available in two architecture variants – ARMv6-M and ARMv7-M. They provide different capabilities:
ARMv6-M Architecture
- Minimalist architecture optimized for low cost MCUs
- Very compact – Just 16 32-bit base instructions
- Efficient 16-bit Thumb instruction set encoding
- Only includes basic DSP extensions
- Ideal for simple embedded applications like sensors and IoT nodes
ARMv7-M Architecture
- Enhanced architecture with full DSP/floating-point support
- Similar base instruction set as ARMv6-M
- Adds comprehensive DSP extensions for vector operations
- Includes single and double precision floating point unit
- Suited for industrial control, motor drives, automation etc.
Cortex-M Processor Variants
ARM offers a wide range of Cortex-M processor variants targeting diverse performance points:
Ultra-low Power Variants
- Cortex-M0+ – 32-bit CPU optimized for lowest cost and power
- Cortex-M1 – Older 32-bit predecessor to Cortex-M0+
- Cortex-M23 – M0+ variant withArm v8-M security extensions
- Cortex-M33 – M0+ variant with TrustZone, Floating Point Unit
Target extremely cost and power sensitive applications like sensors, wearables etc. Very compact embedded code footprint suitable even for 8/16-bit replacements.
Mainstream Low-power Variants
- Cortex-M3 – Older mainstream 32-bit performance variant
- Cortex-M4 – Mainstream application processor with DSP extensions
- Cortex-M7 – Highest performance MCU optimized for real-time applications
Balances power efficiency with high performance. Widely used in embedded IoT, industrial, consumer and automotive applications.
High-Performance Application Processors
- Cortex-R4 and Cortex-R5 – Real-time microprocessors for safety-critical
- Cortex-R4F and Cortex-R5F – R4/R5 with lock-step redundancy
- Cortex-R52 – Dual-core R5 for functional safety
Designed for reliability and functional safety requirements of motor control, industrial transport, robotics etc. Lock-step redundant capability for up to Safety Integrity Level (SIL) 3.
Conclusion
The modified Harvard architecture adopted by Cortex-M processors provides an optimized balance of performance and flexibility for embedded systems. Logical separation of instruction and data memory enables deterministic instruction fetch, concurrent access and benefits such as self-modifying code. At the same time, careful integration allows tight coupling between the two address spaces. Features like the MPU build on this solid foundation to enable customizable system implementations.
ARM has leveraged this memory architecture across an extensive range of Cortex-M processor variants from ultra low-power to high-performance. Their widespread adoption in IoT, industrial, automotive and consumer applications is a testament to the efficiency and versatility of the underlying modified Harvard architecture.