armv8 unaligned access

Unaligned memory accesses refer to accessing data at memory addresses that are not multiples of the data size. For example, accessing a 4-byte integer at address 0x1003 is an unaligned access because the address is not a multiple of 4 bytes. ARMv8 processors handle unaligned accesses differently than previous ARM architectures.

Contents

Loads and Stores Atomic and Exclusive Accesses Floating Point Loads SIMD Loads and Stores Instruction Fetches TLB Mappings Unaligned Faults Performance Impact Compiler Handling Summary

In ARMv7 and earlier, unaligned accesses were generally unsupported and would result in an alignment fault exception. This required software to explicitly handle unaligned accesses by aligning the data before accessing it. However, supporting unaligned accesses in software has a performance cost.

ARMv8 takes a different approach and supports unaligned accesses directly in hardware. This removes the software overhead of aligning data. However, there are still some caveats to keep in mind with ARMv8 unaligned accesses:

Loads and Stores

ARMv8 allows unaligned loads and stores for all data types. For example, a 4-byte integer load from address 0x1003 will be performed as a single unaligned 4-byte access. However, performance is optimal when data accesses are aligned.

Unaligned loads may cross cache line boundaries and result in more than one cache line being read. This can reduce performance compared to an aligned load within a single cache line. Unaligned stores also have a performance penalty if they cross cache line boundaries.

Atomic and Exclusive Accesses

ARMv8 requires exclusive and atomic memory accesses, such as load-exclusive/store-exclusive, to be naturally aligned. Unaligned exclusive or atomic accesses will fault. This maintains expected atomicity and exclusivity semantics.

Floating Point Loads

ARMv8 allows unaligned loads of 32-bit and 64-bit floating point data. However, it does not allow unaligned 128-bit floating point loads. A 128-bit floating point load, such as for a {double, double} vector, must be 16-byte aligned. An unaligned 128-bit floating point load will fault.

SIMD Loads and Stores

SIMD loads and stores support unaligned access in ARMv8. For example, a SIMD vector load or store can start at an arbitrary byte address. However, performance is optimal when SIMD data is aligned to its natural alignment.

SIMD loads and stores may cross cache line or page boundaries and be split into multiple separate accesses. Accessing SIMD data that crosses these boundaries will impact performance. Aligning SIMD data to cache line and page boundaries can improve performance.

Instruction Fetches

ARMv8 requires instruction fetches to be aligned. Instruction addresses must be 4-byte aligned otherwise an alignment fault will occur. Jump targets and branch destinations must also be aligned. This avoids complex logic to handle unaligned instruction fetches.

TLB Mappings

ARMv8 translates virtual addresses to physical addresses via the Translation Lookaside Buffer (TLB). The minimum granularity is 4KB pages, meaning virtual addresses are mapped to 4KB aligned physical addresses.

If an unaligned access crosses a 4KB page boundary, it results in accesses to two separate physical pages. This requires two TLB lookups instead of one, hurting performance. Aligning data to 4KB page boundaries can avoid this.

Unaligned Faults

Even though ARMv8 supports unaligned accesses, there are cases where an unaligned access may still fault:

Atomic or exclusive accesses must be aligned
128-bit FP loads must be 16-byte aligned

Instruction fetches must be 4-byte aligned
An access that crosses a region with different memory attributes or permissions may fault

If a fault occurs, it will generate an Alignment Fault exception. The faulting address will be captured in the Fault Address Register (FAR). Software must handle the alignment fault and emulate the required unaligned behavior if needed.

Performance Impact

Allowing unaligned accesses avoids software overhead to align data. However, unaligned accesses can still hurt performance in certain cases:

Unaligned loads/stores may cross cache line boundaries and reduce cache efficiency
Unaligned SIMD accesses may cross cache or page boundaries, requiring multiple separate memory accesses

Unaligned accesses may require two TLB lookups instead of one if crossing 4KB page boundaries

In performance sensitive code, aligning data and accesses to match the access size, cache lines, pages, and other architecture features will provide optimal performance. Unaligned accesses should be avoided where possible in hot code paths.

Compiler Handling

Compilers can generate both aligned and unaligned accesses depending on context. For load/store intrinsics like ldur/stur, the compiler will handle alignment based on the address expression.

For SIMD intrinsics, the compiler may generate an unaligned access or use inline logic to emulate an unaligned access using aligned vector loads/stores. This is transparent to the programmer.

The compiler may also automatically generate unaligned accesses in cases it determines there is no performance cost. For example, scalar integer loads within a loop may be unaligned across iterations to improve code density.

Summary

ARMv8 broadly supports unaligned accesses in hardware, avoiding software alignment overhead. However, performance is optimal when memory accesses match the alignment of architecture features like cache lines. Unaligned vector and SIMD accesses in particular can hurt performance.

Compilers mitigate unaligned access performance issues in most cases. But for performance critical software, aligning data and accesses manually can help. Watch for unaligned faults on atomic/exclusive accesses and 128-bit FP loads. Handle faults gracefully and emulate unaligned behavior if needed.

armv8 unaligned access

Loads and Stores

Atomic and Exclusive Accesses

Floating Point Loads

SIMD Loads and Stores

Instruction Fetches

TLB Mappings

Unaligned Faults

Performance Impact

Compiler Handling

Summary

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

What is ARM Cortex-R82?

Debugging On-Chip Flash and RAM with Cortex-M1 and ULINK2

What are the core registers in the Cortex M0?

Fixing “unknown compiler option ‘-lint’” error when compiling Cortex-M0 in ModelSim