The ARM Cortex-M4 is a powerful 32-bit processor optimized for low-power embedded applications. At the heart of the Cortex-M4 is the Thumb-2 instruction set, which builds upon the popular Thumb instruction set with additional 16-bit and 32-bit instructions for improved performance and functionality.
In this article, we will take a deep dive into the Thumb-2 instruction set and explain the various opcodes supported by the Cortex-M4. Understanding the opcodes is key to effectively programming and optimizing code for these processors.
Thumb-2 Instruction Set Overview
The Thumb-2 instruction set is a variable-length instruction set that combines both 16-bit and 32-bit opcodes. This allows small 16-bit opcodes to be used for common instructions, resulting in better code density compared to a traditional 32-bit only instruction set. At the same time, 32-bit opcodes are available for more complex instructions and functionality.
Broadly, the Thumb-2 instruction set can be grouped into the following categories:
- Branch and Control Flow Instructions
- Data Processing Instructions
- Load/Store Instructions
- Floating Point Instructions
- Advanced SIMD Instructions
- Supervisor Call and Coprocessor Instructions
In the rest of this article, we will examine the key opcodes in each of these instruction groups and explain their usage in Cortex-M4 programming.
Branch and Control Flow Instructions
Branch instructions alter the program flow by jumping to a different part of the code. Some common branch opcodes in Thumb-2 are:
- B – Unconditional branch
- B.cond – Conditional branch based on status flags
- CBZ/CBNZ – Compare and Branch on Zero/Non-Zero
- TBZ/TBNZ – Test Bit and Branch on Zero/Non-Zero
- BL/BLX – Function calls
The B and BL opcodes are followed by a signed offset specifying the branch target address. Conditional branches check the status flags from previous instructions and branch accordingly.
CBZ/CBNZ opcodes compare a register value against zero and branches based on the result. TBZ/TBNZ check a specific bit position in a register and branches. These conditional branch opcodes are very useful for conditional testing and loops.
In addition to branches, the M4 includes control flow instructions like breakpoint (BKPT), hang (HALT), no operation (NOP) and others.
Data Processing Instructions
Data processing instructions operate on register values or immediate constants. Common data processing opcodes are:
- ADD/SUB – Addition & Subtraction
- ADC/SBC – Addition & Subtraction with Carry
- AND/ORR – Logical AND & OR
- EOR – Logical Exclusive OR
- LSL/LSR – Logical Shift Left/Right
- ASR – Arithmetic Shift Right
- CMP/CMN – Compare & Compare Negative
- MOV/MVN – Move and Move Not
These provide basic arithmetic, logical, shift and move capabilities. Status flags are updated automatically based on the results to facilitate conditional execution.
In addition, 32-bit multiply (MUL) and divide (SDIV) instructions are included for integer math along with saturating arithmetic variants (QADD, QDADD, etc) that saturate results to min/max values instead of overflowing.
Load/Store Instructions
Load/store instructions move data between registers and memory. The most common load/store opcodes are:
- LDR – Load register from memory
- STR – Store register to memory
These come in multiple flavors like LDRB/STRB (8-bit), LDRH/STRH (16-bit), LDRD/STRD (two 32-bit registers). Addressing modes include offset, pre-indexed, post-indexed etc.
Exclusive and unprivileged load/store variants (LDREX, STREX, LDRT, STRT) are provided for exclusive access and user mode access control. Atomic add and set opcodes (LDADD, LDSET) allow safe manipulation of values in memory.
Floating Point Instructions
The Cortex-M4 includes single precision floating point (FP) capability with separate 32-bit FP registers. Key floating point opcodes are:
- FLDS/FSTS – Load/Store FP register
- FMUL/FDIV/FADD/FSUB – FP Arithmetic
- FCMP – FP Compare
- FCVT – FP Convert between float and integer
These floating point instructions allow efficient float math capability to be added to M4 designs.
Advanced SIMD Instructions
SIMD (Single Instruction Multiple Data) instructions allow parallel operation on multiple data elements packed into registers. The M4 includes optional Advanced SIMD support with 32x 128-bit registers and NEON opcodes like:
- VADD/VMUL – Add/Multiply Packed Integers
- VPADD – Pairwise add
- VLDM/VSTM – Load/Store Multiple VFP Registers
- VMOV – Move between Scalar and SIMD/VFP
This allows significant acceleration for multimedia and signal processing workloads on Cortex-M4 designs with Advanced SIMD.
Supervisor Call and Coprocessor Instructions
The M4 provides supervisor call (SVC) and coprocessor (CDP) instructions to extend functionality:
- SVC – Generate a supervisor call exception
- CDP – Coprocessor operations
SVCs allow switching from thread mode to handler mode for privilege checking. CDP provides extensibility to connect customized coprocessors.
Coding for the Cortex-M4
Now that we have seen the key Thumb-2 opcodes, here are some tips for coding effective Cortex-M4 assembly and C programs:
- Use 16-bit Thumb instructions whenever possible for best code density
- Utilize 32-bit instructions for complex operations like multiply or SIMD
- Take advantage of conditional execution for faster branching
- Use exclusive and atomic instructions for safe shared memory access
- Enable Advanced SIMD for parallel processing of multimedia data
- Inline assembly or intrinsic functions can optimize key functions
Profiling tools can identify hotspots to focus optimization work. By applying these techniques, developers can fully harness the performance and functionality of the Cortex-M4 CPU.
Conclusion
The ARM Thumb-2 instruction set provides a versatile combination of 16-bit and 32-bit opcodes to balance code density and performance. Core data processing, branch and control flow, load/store, floating point, SIMD and other instructions enable the Cortex-M4 to deliver exceptional capabilities for embedded applications.
We have explored the key opcodes and features of the Thumb-2 ISA. With this understanding of the instruction set, developers can write optimized Cortex-M4 code to take full advantage of the processor capabilities.