ARM Cortex-M4 microcontrollers have built-in support for unaligned memory access, allowing data to be accessed from memory addresses that are not aligned to the size of the data type. This provides flexibility for programmers and helps reduce code size by eliminating the need for explicit alignment in some situations.
What is Unaligned Memory Access?
Unaligned memory access refers to reading or writing data types like shorts, ints, longs etc from memory addresses that are not evenly divisible by the size of the data type. For example, accessing a 4-byte integer from address 0x1003 instead of a 4-byte aligned address like 0x1000. This is in contrast to aligned access where data types are only accessed from addresses that are multiples of their size.
Most processors require data types to be aligned for efficient access. ARM Cortex-M4 provides hardware support for unaligned accesses, removing this requirement in software.
Why Unaligned Accesses Occur
There are several common situations where unaligned memory access occurs:
- Accessing fields within packed structs: Since each field starts right after the previous one, they are often unaligned.
- Typecasting pointers to other data types: The pointer may not be aligned to the new data type.
- Accessing external data formats like network packets or filesystems where alignment is not guaranteed.
- Overlaying structs on top of contiguous buffers.
Requiring aligned access in these situations increases code size from extra padding and alignment checks. ARM Cortex-M4’s built-in support avoids this.
Handling Unaligned Accesses
The ARMv7-M architecture that Cortex-M4 uses provides hardware mechanisms to support unaligned accesses efficiently:
Unaligned Data Load/Store
The LDM, STM, LDR, STR instructions used for memory access have variants like LDRB, LDRH, etc for 8-bit, 16-bit datatypes. These variants have an option to handle unaligned addresses automatically.
For example: LDRH R1, [R2] ; 16-bit aligned load LDRSH R1, [R2] ; 16-bit unaligned load
The SH (unaligned) versions efficiently load data from any 2-byte address into the register.
Unaligned Single Load/Store
Single data loads like LDR and STR also support unaligned access using the B, H, SB, SH postfix: LDR R1, [R2] ; 32-bit aligned load LDRH R1, [R2] ; 16-bit unaligned load
This performs an unaligned load of a 16-bit halfword from any address in R2.
Hardware Decomposing
For unaligned loads, the Cortex-M4 hardware automatically decomposes them into separate aligned loads. For example: 0x1000: 0x12 0x34 0x56 0x78 +- 32-bit int -+ Unaligned LDR at 0x1002: 1. Load lower 2 bytes (0x34 0x56) 2. Load upper 2 bytes (0x12 0x78) 3. Combine together
The decomposing is done transparently in hardware for any unaligned access.
Unaligned Access Exceptions
Generally unaligned accesses work efficiently on Cortex-M4. But in some cases like loading a 32-bit int from a non-word address, it may trap and raise a usage fault exception. This is configurable via the CCR.UNALIGN_TRP bit. CCR.UNALIGN_TRP = 0; // No trap (default) CCR.UNALIGN_TRP = 1; // Trap unaligned
With trap enabled, unaligned LDR/STR will trigger UsageFault_IRA event, allowing handlers to be written for specific unaligned cases.
Performance of Unaligned Accesses
Unaligned accesses on Cortex-M4 generally do not affect performance, thanks to the hardware decomposition mechanisms. However, there are some scenarios where aligned access may be faster:
- Sequential access: Aligned accesses avoid decomposing penalty
- Flash access: Writing flash is faster when aligned
- External bus: Unaligned may need multiple bus transfers
So while unaligned access support removes the need for explicit alignment in code, aligning where possible can provide a small performance boost.
Using Unaligned Accesses in Code
Here are some ways unaligned access support can be leveraged during programming:
- Avoid casting pointers to integers for alignment checks
- Use packed structs for memory efficiency rather than aligning fields
- Overlay buffers without worrying about alignment
- No need to align external data before accessing
- Fetch data types from unaligned addresses directly
Some examples: // Packed struct struct __packed { uint8_t len; uint32_t addr; } pkt; // Overlay buffer uint8_t buf[10]; uint16_t *data = (uint16_t *)buf; // No alignment needed process_data(rx_buff);
So unaligned access support directly helps reduce code size and complexity in many common situations.
Effects on Code Density
A key benefit of unaligned support is reducing code size by avoiding explicit alignment in code. Here are some common code patterns that are no longer needed with Cortex-M4:
- Pointer casts and checks for alignment
- Aligning stack variables
- Padding structs
- Intermediate copying of data to align
- Specialized aligned and unaligned versions of functions
For example, this code fragment aligns a buffer before accessing it: uint32_t buf[10]; void *align_ptr = (void *)(((uint32_t)buf + 3) & ~3); uint32_t *p = (uint32_t *)align_ptr; *p = 0x12345678;
With unaligned support, this simply becomes: uint32_t buf[10]; buf[0] = 0x12345678;
By removing such alignment handling code, overall code density improves.
Use of Unaligned Accesses by Compilers
Modern ARM compilers like armclang and gcc can automatically generate unaligned accesses when beneficial. For example: struct s { uint8_t len; uint32_t addr; } pkt; pkt.addr = 0x12345; // Compiled as unaligned access
Compilers may also use inline assembly with LDRH, LDRSH to utilize unaligned transfers. Flags like gcc’s -munaligned-access force unaligned generation. So compiling with these compilers allows getting code density benefits without changing code.
Conclusion
The ARM Cortex-M4 microarchitecture directly supports unaligned accesses for loads, stores and memory copy. This avoids the need for manual alignment handling in code, improving code density. Hardware decomposing provides efficient unaligned data access without impacting performance in most cases. Compiler utilization of unaligned instructions also automatically improves code density. Overall, unaligned access support directly helps reduce code size and complexity for ARM Cortex-M4-based microcontrollers.