The ARM Cortex-M3 is a 32-bit processor core licensed by Arm Holdings. It is part of the Cortex-M series of microcontroller cores, and is designed for embedded applications requiring a low power consumption CPU with good performance. The Cortex-M3 CPU has a 3-stage pipeline and includes features like Thumb-2 instruction set, Nested Vectored Interrupt Controller, optional Memory Protection Unit, and optional Single Instruction Multiple Data (SIMD) instruction support.
The Cortex-M3 implements the ARMv7-M architecture which includes the Thumb-2 instruction set. Thumb-2 extends the previous Thumb (Thumb-1) instruction set with additional 32-bit instructions while retaining all the existing 16-bit Thumb-1 instructions. This allows Thumb-2 code to achieve similar performance as ARM code while having higher code density. The Thumb-2 instruction set includes both 16-bit and 32-bit instructions which can be freely intermixed in Thumb-2 code.
Data Processing Instructions
The ARM Cortex-M3 and Thumb-2 instruction set provides various data processing instructions to operate on registers and constants. These include:
- Arithmetic instructions like ADD, SUB, MUL, etc
- Logical instructions like AND, ORR, EOR, etc
- Shift and rotate instructions like LSL, LSR, ASR, ROR, etc
- Move instructions like MOV, MVN
- Compare instructions like CMP, CMN, TST
- Bit field instructions like BFI, BFC
These data processing instructions allow efficient manipulation of data stored in registers. The instructions can take register operands, constant immediate operands or both. Next we’ll look at some examples of using these data processing instructions.
Arithmetic Instructions
Arithmetic instructions like ADD, SUB perform addition and subtraction on register operands. For example: ADD R1, R2, R3 //R1 = R2 + R3 SUB R5, R3, #10 //R5 = R3 – 10
Signed integer multiplication can be performed using MUL instruction: MUL R2, R1, R3 //R2 = R1 * R3
The SMUL and SMLA instructions allow signed integer multiply accumulate operations: SMUL R5, R2, R9 //R5 = R2 * R9 SMLA R4, R1, R3, R4 //R4 = R4 + R1 * R3
Logical Instructions
Logical AND, OR, XOR operations on registers can be performed using AND, ORR, EOR instructions. For example: AND R2, R5, #0xF //R2 = R5 AND 0xF ORR R4, R1, R8 //R4 = R1 OR R8 EOR R7, R3, R9 //R7 = R3 XOR R9
The ANDS, ORRS, EORS instructions update status flags based on operation result.
Bitwise NOT can be performed using MVN (Move NOT) instruction: MVN R6, R8 //R6 = NOT R8
Shift and Rotate Instructions
Barrel shifter and bit rotation operations on registers can be done using the shift/rotate instructions.
Logical Shift Left (LSL), Logical Shift Right (LSR) perform bit shift operations. Arithmetic Shift Right (ASR) performs shift considering sign bit. LSL R5, R3, #2 //R5 = R3 << 2 LSR R2, R7, #5 //R2 = R7 >> 5 ASR R4, R1, #3 //Arithmetic shift right R1 by 3
Rotate Right (ROR) and Rotate Right Extended (RRX) perform bit rotation by specified amount or by carry flag respectively. ROR R8, R2, #8 //Rotate R2 right by 8 bits RRX R9, R1 //Rotate R1 right by carry flag
Move Instructions
The MOV instruction copies a value from one register to another register. For example: MOV R4, R8 //R4 = R8
It can also move an immediate constant value into a register. MOV R2, #0x55 //R2 = 0x55
The MVN instruction performs bitwise NOT operation during the move. For example: MVN R6, R8 //R6 = NOT R8
Compare Instructions
Compare instructions like CMP, CMN, TST are used to compare two operands. They update the status flags but don’t store result in a register.
CMP performs subtraction, CMN performs addition and TST performs logical AND operation for comparing: CMP R1, #5 //Compare R1 – 5 CMN R3, R7 //Compare R3 + R7 TST R2, R4 //Compare R2 AND R4
These instructions help to perform conditional testing and branching.
Bit Field Instructions
Bit Field instructions allow access to and manipulation of a specific bit-field within a register. For example: BFI R5, R8, #3, #5 //Insert 5 bits from R8 into R5 from bit 3 BFC R4, #7, #3 //Clear 3 bits in R4 from bit 7
This allows bit masks and bit flags to be created and maintained in registers for efficient bit manipulation.
Addressing Modes
The ARM Cortex-M3 data processing instructions support several addressing modes for specifying the operands. This includes using registers, constants and labels for the instructions.
Register Mode
In register addressing mode, a register is specified as an operand. This allows instructions operations between CPU registers. ADD R1, R2, R3 //Register operands CMP R4, R8 //Register operands
Immediate Mode
In immediate addressing mode, a constant value is specified as an operand. This is useful for simple constant operations. ADD R1, R2, #10 //Immediate constant operand CMP R4, #0xF //Compare with immediate value
Label Mode
PC relative addressing using labels can be used for jump and branch instructions. The label refers to a memory address location. BNE loop //Branch to label ‘loop’ CBZ R1, begin //Branch if R1 is 0
Scaled Register Mode
Certain ARM instructions like LDR, STR allow scaled register addressing mode. The offset register is shifted left by the scale amount before being added. LDR R5, [R2, R1, LSL #3] //R1 is scaled by 3 before offset STR R8, [R4, R6, LSL #2] //R6 is scaled by 2 before offset
This helps index arrays and structured data by eliminating extra shift instructions.
Condition Flags
The ALU instructions update the 4 condition flags in the Application Program Status Register (APSR) based on the result:
- N – Negative flag
- Z – Zero flag
- C – Carry flag
- V – Overflow flag
These flags can be tested using conditional execution instructions like BNE, BEQ, BMI etc. Some examples: CMP R1, R2 //Compare R1 – R2 BGT label //Branch if R1 > R2 (tests N,V,C flags) SUB R3, R4 //R3 = R4 – R3 BLT label //Branch if R3 < 0 (tests N flag)
This allows code execution to be conditional based on results of previous arithmetic or logical instructions.
Pipelines and Performance
The Cortex-M3 uses a 3 stage pipeline – Fetch, Decode and Execute. This enables some basic parallelism and increases performance compared to sequential non-pipelined execution.
While one instruction executes (Execute stage), the next instruction can be decoded (Decode stage) and another fetched (Fetch stage). If instructions are independent, they can execute in parallel through the pipeline improving performance.
The branch predictor reduces pipeline stalls by guessing the target of branches. The Memory Protection Unit (MPU) can improve performance by allowing faster memory accesses to protected regions.
Overall the Cortex-M3 pipeline along with the Thumb-2 instruction set provides good performance for embedded applications while minimizing energy consumption.
Instruction Set Encoding
The Thumb-2 instruction set uses both 16-bit and 32-bit instruction encodings. A subset of the instructions are available in both 16-bit and 32-bit formats. 16-bit: MOVS R5, #100 //Move immediate value to R5 CMP R1, R2 //Compare R1 and R2 32-bit: MOVWS R8, #1000 //Move wider immediate value SUBS R3, R4, R5 //Subtract with status flag update
The 16-bit format provides higher code density while the 32-bit format allows larger immediate constants and more functionality like updating status flags.
Some instruction classes like branch and load/store instructions are available only in 16-bit format. While some complex instructions like multiply are only available in 32-bit format.
The unified 16-bit and 32-bit encoding allows Thumb-2 to achieve good performance and code density – making it very suitable for embedded applications.
Instruction Set Optimization
Here are some tips for optimizing code to make best use of the Cortex-M3/Thumb-2 instruction set architecture:
- Use 16-bit instructions whenever possible for better code density
- Minimize branching to avoid pipeline stalls
- Use conditional execution instead of branches if possible
- Combine addition/subtraction with status flag update to eliminate CMP
- Use scaled register offset addressing mode to avoid extra shifts
- Utilize SIMD instructions to perform parallel arithmetic where possible
- Take advantage of constant pools to avoid large MOV instructions
- Optimize shift operations using MOV + LSL/LSR instead of LSL/LSR
Proper register allocation, efficient bit manipulation and taking advantage of pipelines/caches also helps improve performance. Compilers will handle many optimizations automatically nowadays.
Instruction Latency and Throughput
Instructions have different latencies and throughputs depending on their type and pipeline implementation.
Latency determines the number of cycles needed to get the result of an instruction. Simple ALU instructions have just 1 cycle latency.
Throughput determines how many instructions can execute per cycle. Pipelined execution and parallel execution units allow higher throughput.
On the Cortex-M3, most ALU ops take just 1 cycle for both latency and throughput. This includes:
- Additions – ADD, SUB, ADC, SBC etc.
- Logical – AND, ORR, EOR, BIC
- Move – MOV, MVN
- Compare – CMP, CMN, TST
- Shift/Rotate – LSL, LSR, ASR, ROR
Multiplies take more cycles with MUL at 1 cycle latency but 1/32-bit throughput. Multiply-Accumulates like MLA take 2 cycle latency and 1/2-bit throughput.
Load-Store instructions have 2 cycle latency and 1 cycle throughput. Branch instructions have 1 cycle latency while throughput depends on branch prediction.
Knowing the instruction timings allows proper scheduling to avoid potential pipeline stalls. This helps achieve maximum performance.
Instruction Set Summary
To summarize, the Thumb-2 instruction set provides a versatile set of data processing, memory access and flow control instructions for the Cortex-M3 CPU.
Key highlights:
- High performance 32-bit instructions mixed with compact 16-bit instructions
- Flexible arithmetic, logical, comparison, bit-field, and move instructions
- Efficient shift and rotate instructions
- Load/store instructions with scaled offset addressing
- Conditional execution for branches
- Constant pools and PC relative branch addressing
- Pipelined implementation for instruction parallelism
Overall, Thumb-2 provides an excellent instruction set architecture that balances performance, code density and power efficiency for embedded system development using the Cortex-M3 processor.
Conclusion
In this article, we looked at how data processing instructions work on the ARM Cortex-M3 CPU. We covered the various arithmetic, logical, move, compare, shift/rotate instructions and their addressing modes. We also saw how the condition flags help implement conditional execution after an ALU operation. Techniques for optimizing instructions were discussed along with details about pipelines and instruction timings.
The Thumb-2 instruction set with its mix of 16-bit and 32-bit instructions provides a great combination of high performance, good code density and low power consumption. Developers can leverage the capabilities of the instruction set architecture effectively to build efficient embedded applications using the Cortex-M3 processor.