The ARM architecture refers to a family of reduced instruction set computing (RISC) processors that are widely used in embedded systems and mobile devices. The ARM processors are known for their power efficiency and performance capabilities. Over the years, ARM has introduced several microarchitecture versions with improvements and new features. The major ARM processor families include ARM7, ARM9, ARM11 and Cortex.
ARM7
The ARM7 processors were one of the first ARM-based CPUs introduced in the early 1990s. The ARM7 core architecture leveraged a 3-stage pipeline and operated at frequencies between 30-100 MHz. Some key features of ARM7 processors include:
- 3-stage pipeline to enable faster instruction execution
- Von Neumann architecture with integrated cache
- Thumb 16-bit instruction set to improve code density
- EmbeddedICE module for debug and trace capabilities
- Support for up to 4GB physical address space
- Optional MMU for memory protection and virtual memory support
The ARM7TDMI core was one of the most popular ARM7 variants that incorporated a thumb instruction set and a debug module. Key applications of ARM7 processors include early mobile phones, PDA devices, disk drives, routers and printers.
ARM9
Introduced in 2001, the ARM9 family was designed to address the growing performance requirements of mobile and embedded applications. The ARM9 cores were based on an evolved 5-stage pipeline executing the ARMv4T instruction set. Here are some major improvements of ARM9 over previous ARM7 cores:
- Higher clock speeds up to 200 MHz
- 5-stage pipeline for improved performance
- Updated thumb instruction set Thumb-2 with conditional execution
- Enhanced DSP instructions and saturating arithmetic
- Coprocessor interface to support multimedia extensions
- Tightly-coupled memory support
- Multi-master AXI bus interface
The ARM9 family includes variants like ARM920T, ARM922T, ARM925T, ARM926EJ-S, ARM946E-S etc. ARM9 processors were used in early smartphones, feature phones, WiFi routers, portable media players and handheld gaming devices.
ARM11
The ARM11 microarchitecture was announced in 2002 built on the ARMv6 instruction set with Jazelle DBX Java acceleration. The ARM11 cores provided substantially higher performance through microarchitectural optimizations like:
- 8-stage pipeline to enable higher clock speeds
- Improved branch prediction and return stack
- Faster integer and floating-point arithmetic
- Pipelined memory architecture
- Wider cache interface and higher L1 cache bandwidth
- SIMD media instructions for multimedia
- Adaptive power control for power efficiency
Some common ARM11 family processors are ARM1136JF-S, ARM1156T2F-S, ARM1176JZF-S, ARM11 MPCore etc. ARM11 processors were used in smartphones like iPhone 3GS/4, Nokia N900 and other mobile devices.
ARM Cortex
Introduced from 2004 onwards, ARM Cortex processors represent the most advanced and contemporary ARM processor family optimized for high performance and energy efficiency. The Cortex cores are branded based on their performance and intended applications as Cortex-A (Application), Cortex-R (Real-time) and Cortex-M (Microcontroller). Some major distinguishing features of Cortex processors include:
- Advanced multi-stage pipelines and superscalar execution
- SIMD and VFPv3/v4 floating-point support
- NEON media processing engine
- Thumb-2 instruction set with CLZ, BLX operands
- Hardware divide and floating-point divide
- TrustZone security extensions
- Accelerator Coherency Port (ACP) interface
- Faster inter-core communication with Multiprocessor ILMs
- 64-bit memory addressing and IO coherence
The Cortex-A series targets high-performance application processing and includes variants like Cortex-A5, A7, A8, A9, A12, A15, A17, A32, A35, A53, A55, A57, A65, A72, A73 up to the latest Cortex-A77, A78 and A710. The Cortex-R series is optimized for real-time applications while the Cortex-M series targets low-power microcontroller applications.
Key Differences
Here is a summary of some of the key differences between the ARM processor families:
Features | ARM7 | ARM9 | ARM11 | Cortex |
---|---|---|---|---|
Release timeframe | Early 1990s | 2001 | 2002 | 2004 onwards |
Instruction set | ARMv3/v4 | ARMv4T | ARMv6 | ARMv7 and above |
Pipeline stages | 3 | 5 | 8 | 8 and above |
Process technology | 0.35u to 0.18u | 0.18u to 90nm | 130nm to 65nm | 45nm to 5nm |
Clock Speed | 30-100 MHz | Up to 200 MHz | Up to 600 MHz | Up to 3 GHz+ |
Features | Basic 3-stage pipeline, Thumb ISA for code density, MMU optional | Improved 5-stage pipeline, multimedia and DSP extensions | Deeper 8-stage pipeline, Jazelle DBX, media instructions | Advanced multi-stage pipelines, NEON, TrustZone, multi-core |
Applications | PDA, routers, printers | Feature phones, smartphones, gaming devices | Smartphones, mobile devices | Smartphones, tablets, servers, IoT devices |
In summary, the ARM architecture has continued to evolve over the years with microarchitecture improvements to deliver higher performance, better power efficiency and advanced features from ARM7 to ARM11 to the latest Cortex processors.
ARM7 Core Architecture
The ARM7 core architecture is based on the original ARMv3 instruction set with a 3-stage integer pipeline. The major components of the ARM7 core include:
- Integer pipeline – 3-stage pipeline with Fetch, Decode and Execute stages
- Register bank – Thirty-seven 32-bit general purpose registers including PC and SP
- Barrel shifter – For shift and rotate operations
- ALU – Arithmetic Logic Unit for arithmetic and logic operations
- Memory interface – Supports up to 4GB physical address space
- Write buffer – To manage write bandwidth between processor and memory
- EmbeddedICE – Debug and trace support
- Thumb decoder – To decode 16-bit Thumb instructions
The integer pipeline comprises of three main stages – Fetch, Decode and Execute. Some of the key operations in the pipeline are:
- Fetch – Fetches instructions from I-cache or memory
- Decode – Decodes instructions into control signals
- Execute – Performs operations like ALU ops, address calculation, shifts, multiply
- Memory – Load/Store operations done in Execute stage
- Writeback – Writeback of operation results if required
The ARM7TDMI core also added a 16-bit Thumb instruction set to improve code density. The two main configurations supported are:
- ARM state – 32-bit ARM instruction set
- Thumb state – 16-bit compressed Thumb instruction set
The EmbeddedICE module provides debug capability and trace support for application development.
ARM9 Core Architecture
The ARM9 family was an evolutionary upgrade over ARM7 with key improvements:
- Higher performance 5-stage integer pipeline
- Introduction of Jazelle DBX for Java acceleration
- Enhanced Thumb-2 instruction set support
- Media processing extensions like saturating arithmetic
The ARM9 core has a 5-stage integer pipeline comprising:
- Fetch – Fetch instructions from I-cache or memory
- Decode – Decode instructions and read register operands
- Execute – Execute ALU operations and calculate memory addresses
- Memory – Perform load/store from data cache or memory
- Writeback – Writeback results to register bank
The pipeline enables faster instruction execution with reduced stalls. Key components of the ARM9 core include:
- Register bank – Thirty-seven 32-bit registers including 3 status registers
- Barrel shifter – Combined shifter and logic unit for shifts/rotates
- ALU – 32-bit Arithmetic Logic Unit
- MAC – Optional 32-bit multiplier accumulator unit
- Jazelle DBX – Hardware accelerator for Java bytecodes
- Write buffer – To reduce stalls in case of writes
- System Control – Handling interrupts, exceptions and coprocessors
The ARM9 implemented the Thumb-2 extensions to the Thumb ISA with conditional execution and branches. This enabled higher code density without compromising on performance.
ARM11 Core Architecture
The ARM11 core architecture was an evolution of the 8-stage pipeline implementing the ARMv6 instruction set. Some of the major improvements included:
- 8-stage pipeline for higher clock speeds
- Improved branch prediction and return stack
- 2-cycle multiplier for better arithmetic performance
- Stores buffer to reduce stalls on writes
- Operand forwarding unit to avoid pipeline bubbles
- Coprocessor interface for multimedia acceleration
The integer pipeline in ARM11 comprises the following key stages:
- Fetch – Fetch instructions from I-cache
- Decode – Decode instructions into control signals
- Register Access – Read register operands for execution
- Execute – Perform ALU ops, shift/rotate, multiply, branches
- Data Cache – Load/Store from data cache and memory
- Second Data Cache – Additional time for cache/memory access
- Writeback – Writeback results of operations
- Register Write – Write results into register bank
Some key components of the ARM11 core include:
- Register bank – Sixteen 32-bit registers and twenty-six 64-bit registers
- Barrel shifter – Combined shifter and logic unit
- ALU – 32-bit Arithmetic Logic Unit
- MAC – 32-bit multiplier Accumulator
- Jazelle – Acceleration for Java bytecodes
- Write buffer – 8-entry write buffer to reduce stalls
- System Control – Handling of interrupts, exceptions, coprocessors
The ARM11 core delivered substantially higher performance coupled with power efficiency through microarchitecture improvements like deeper pipeline, caching, forwarding and branch prediction.
ARM Cortex-A Series
The ARM Cortex-A series targets high-performance application processing requirements. Let us look at the microarchitecture of Cortex-A8 as an example.
The Cortex-A8 is an 8-stage superscalar pipeline with features like:
- Out-of-order execution for improved performance
- Program flow prediction to reduce stalls
- NEON media engine for SIMD processing
- Thumb-2 instruction set with efficient branch encoding
- Optional L2 cache interface for lower latency access
The integer pipeline consists of the following stages:
- Fetch – Fetch up to 2 instructions from L1 I-cache
- Decode – Decode instructions into micro-ops
- Issue – Issue up to 5 micro-ops for execution
- Execute – Execute ALU ops,branches, multiplies
- Memory – Perform load/store operations to L1 D-cache
- Complete – Signals that execution unit has completed
- Retire – Retires the instruction results
- Writeback – Writes back results to architectural state
Some of the key components of the Cortex-A8 core include:
- Register files – Thirty-one 64-bit integer registers, Thirty-two 64-bit NEON registers
- Execution units – 2 Integer ALUs, Load/Store unit, SIMD NEON unit
- Branch predictor – Advanced branch predictor for low misprediction
- Instruction caches – Separate L1 caches for Instruction and Data
- Cache coherence – Snooping support for cache coherency
- Memory management – MMU with 4KB pages and up to 40-bit physical addresses
The Cortex-A8 was one of the first Cortex processors to feature symmetric multiprocessing capabilities enabling multi-core designs.
Evolution of ARM Instruction Sets
The ARM instruction sets have continued to evolve with additions