Unaligned memory accesses refer to accessing data at memory addresses that are not multiples of the data size. For example, accessing a 4-byte integer at address 0x1003 is an unaligned access because the address is not a multiple of 4 bytes. ARMv8 processors handle unaligned accesses differently than previous ARM architectures.
In ARMv7 and earlier, unaligned accesses were generally unsupported and would result in an alignment fault exception. This required software to explicitly handle unaligned accesses by aligning the data before accessing it. However, supporting unaligned accesses in software has a performance cost.
ARMv8 takes a different approach and supports unaligned accesses directly in hardware. This removes the software overhead of aligning data. However, there are still some caveats to keep in mind with ARMv8 unaligned accesses:
Loads and Stores
ARMv8 allows unaligned loads and stores for all data types. For example, a 4-byte integer load from address 0x1003 will be performed as a single unaligned 4-byte access. However, performance is optimal when data accesses are aligned.
Unaligned loads may cross cache line boundaries and result in more than one cache line being read. This can reduce performance compared to an aligned load within a single cache line. Unaligned stores also have a performance penalty if they cross cache line boundaries.
Atomic and Exclusive Accesses
ARMv8 requires exclusive and atomic memory accesses, such as load-exclusive/store-exclusive, to be naturally aligned. Unaligned exclusive or atomic accesses will fault. This maintains expected atomicity and exclusivity semantics.
Floating Point Loads
ARMv8 allows unaligned loads of 32-bit and 64-bit floating point data. However, it does not allow unaligned 128-bit floating point loads. A 128-bit floating point load, such as for a {double, double} vector, must be 16-byte aligned. An unaligned 128-bit floating point load will fault.
SIMD Loads and Stores
SIMD loads and stores support unaligned access in ARMv8. For example, a SIMD vector load or store can start at an arbitrary byte address. However, performance is optimal when SIMD data is aligned to its natural alignment.
SIMD loads and stores may cross cache line or page boundaries and be split into multiple separate accesses. Accessing SIMD data that crosses these boundaries will impact performance. Aligning SIMD data to cache line and page boundaries can improve performance.
Instruction Fetches
ARMv8 requires instruction fetches to be aligned. Instruction addresses must be 4-byte aligned otherwise an alignment fault will occur. Jump targets and branch destinations must also be aligned. This avoids complex logic to handle unaligned instruction fetches.
TLB Mappings
ARMv8 translates virtual addresses to physical addresses via the Translation Lookaside Buffer (TLB). The minimum granularity is 4KB pages, meaning virtual addresses are mapped to 4KB aligned physical addresses.
If an unaligned access crosses a 4KB page boundary, it results in accesses to two separate physical pages. This requires two TLB lookups instead of one, hurting performance. Aligning data to 4KB page boundaries can avoid this.
Unaligned Faults
Even though ARMv8 supports unaligned accesses, there are cases where an unaligned access may still fault:
- Atomic or exclusive accesses must be aligned
- 128-bit FP loads must be 16-byte aligned
- Instruction fetches must be 4-byte aligned
- An access that crosses a region with different memory attributes or permissions may fault
If a fault occurs, it will generate an Alignment Fault exception. The faulting address will be captured in the Fault Address Register (FAR). Software must handle the alignment fault and emulate the required unaligned behavior if needed.
Performance Impact
Allowing unaligned accesses avoids software overhead to align data. However, unaligned accesses can still hurt performance in certain cases:
- Unaligned loads/stores may cross cache line boundaries and reduce cache efficiency
- Unaligned SIMD accesses may cross cache or page boundaries, requiring multiple separate memory accesses
- Unaligned accesses may require two TLB lookups instead of one if crossing 4KB page boundaries
In performance sensitive code, aligning data and accesses to match the access size, cache lines, pages, and other architecture features will provide optimal performance. Unaligned accesses should be avoided where possible in hot code paths.
Compiler Handling
Compilers can generate both aligned and unaligned accesses depending on context. For load/store intrinsics like ldur/stur, the compiler will handle alignment based on the address expression.
For SIMD intrinsics, the compiler may generate an unaligned access or use inline logic to emulate an unaligned access using aligned vector loads/stores. This is transparent to the programmer.
The compiler may also automatically generate unaligned accesses in cases it determines there is no performance cost. For example, scalar integer loads within a loop may be unaligned across iterations to improve code density.
Summary
ARMv8 broadly supports unaligned accesses in hardware, avoiding software alignment overhead. However, performance is optimal when memory accesses match the alignment of architecture features like cache lines. Unaligned vector and SIMD accesses in particular can hurt performance.
Compilers mitigate unaligned access performance issues in most cases. But for performance critical software, aligning data and accesses manually can help. Watch for unaligned faults on atomic/exclusive accesses and 128-bit FP loads. Handle faults gracefully and emulate unaligned behavior if needed.