Unaligned memory access refers to reading data from or writing data to memory locations that are not multiples of the word size. For example, on a 32-bit architecture, accessing a 4-byte integer at an address that is not divisible by 4 would be considered unaligned. This is in contrast to aligned memory access where data is read from or written to addresses that are multiples of the word size.
Most modern CPUs require aligned memory access for optimum performance. Accessing unaligned data can significantly slow down processing as it requires the CPU to perform extra operations to retrieve the data. There are several reasons why unaligned access leads to a performance penalty:
Crossing Memory Boundaries
When accessing unaligned data, the CPU likely has to read from two different aligned blocks of memory and combine the parts to construct the desired data. For example, to read a 4-byte integer from address 0x10003, the CPU would have to read the last byte from the block at 0x10000 and the first three bytes from 0x10004. This requires extra instructions and memory operations compared to reading aligned data within a single block.
Unoptimized Data Paths
Most CPUs and memory subsystems are optimized for aligned access patterns. Data buses, caches, and prefetchers all assume aligned addresses. Unaligned accesses disrupt these optimizations and force slower path ways to be used. For example, certain 64-bit registers and data buses may not be usable for unaligned 8-byte quantities.
Additional Logic for Realignment
After reading unaligned data from memory, the CPU then has to realign it before usage. This requires shifting and masking logic to put the parts together into the expected aligned format. Similarly, storing unaligned data requires breaking apart the value and masking bits before writing the parts out to memory.
Cache Line Splits
Unaligned data may straddle cache line boundaries and require an extra memory access to retrieve. Most CPUs fetch entire cache lines from memory into the cache. If the unaligned data spans two cache lines, both need to be fetched even if only part of each line is required.
Performance Penalty Examples
As an example, benchmarks on ARM processors have shown unaligned 32-bit integer loads to be 2-3x slower than aligned loads. On x86, unaligned loads were 1.5-2x slower.Stores of unaligned data were similarly penalized. The performance impact also extends to unaligned access of larger data types like 64-bit integers and floats.
One benchmark test showed sequential unaligned 4-byte integer reads having a throughput of ~160 MB/s. The same test with aligned reads achieved ~440 MB/s – over 2.7x higher. Optimization such as using SIMD instructions may have limited benefits for unaligned data.
Unaligned Access in C Programs
In languages like C/C++, there are sometimes requirements for specific data alignment. For example, creating a packed struct with chars followed by ints could result in unaligned ints relative to word boundaries. Calling malloc in C/C++ does not guarantee any particular alignment for the returned pointer.
Accessing fields within misaligned structs or indexing into buffers with unaligned pointers typically triggers unaligned accesses. This can lead to mysterious performance issues in applications. Care should be taken to enforce proper alignment where needed through techniques like padding fields and using aligned allocation functions.
Support for Unaligned Accesses
Some architectures like x86 silently handle unaligned accesses in hardware, reducing the performance penalty. This may lower motivation for programmers to enforce aligned access patterns in software. However, it comes at a cost of increased hardware complexity in the memory subsystem.
Other architectures like ARM and RISC-V have stricter alignment requirements and generally do not transparently support unaligned access in hardware. On these platforms, unaligned access typically results in a processor exception or bus error. Enforcing aligned access is especially important for portable code intended to run on such architectures.
Mitigating Unaligned Accesses
There are several techniques programmers can adopt to avoid unaligned memory accesses:
- Use data types that have alignment requirements matching their size (e.g. uint32_t instead of char)
- Allocate data at addresses aligned to its size, such as with memalign or posix_memalign
- Use compiler directives like __attribute__((aligned(N))) to enforce alignments
- Pad structs to align member fields
- Avoid casts that reinterpret buffers and break alignments
- Use unions to overlay differently aligned data
Checking for alignment before accessing data and handling misalignment separately can also avoid penalizing all accesses. Some architectures provide special unaligned load/store instructions to handle misaligned data more efficiently in hardware.
Different architectures have varying specific alignment requirements. Some examples:
- x86 – aligned on any byte address
- ARM – words aligned to 4-byte addresses
- RISC-V – words aligned to 4-byte addresses
- PowerPC – words aligned to 4-byte addresses
- MIPS – words aligned to 4-byte addresses
- SPARC – words aligned to 8-byte addresses
Larger data types like doubles and long longs have proportional alignment requirements, e.g. 8-byte alignment on 32-bit architectures. Structs and arrays also inherit alignment based on their contents and size.
Tools for Detecting Unaligned Accesses
There are tools available to help detect and diagnose unaligned memory access issues:
- Compiler warnings – Enable alignment warnings and treat them as errors
- Sanitizers – Address sanitizer can detect unaligned accesses
- Debuggers – Can catch alignment fault exceptions
- Disassemblers – View effective address offsets and instruction patterns
- Performance analyzers – Correlate slow functions with unaligned access
- Memory checkers – Valgrind has misaligned memory access detection
Runtime instrumentation and binary analysis tools can also help uncover unaligned access issues in production software.
Alternatives to Unaligned Accesses in Hardware
Some techniques hardware can employ to reduce the performance impact of unaligned accesses include:
- Wider memory interfaces – Fetch extra adjacent bytes into caches
- Multi-byte extract instructions – Pick correct bytes from wider registers
- Barrel shifters – Hardware realignment before usage
- Adaptive alignment logic – Automatically detect and realign
- Multi-bank caches – Reduce conflicts from spans
- Speculative accesses – Hide realignment latency
However, these increase cost and complexity. Requiring aligned access in software is usually a better trade-off.
Unaligned memory access can significantly impact performance on many processor architectures. Accessing data at alignments that do not match the size of the data type forces extra complex operations under the hood in both hardware and software. Enforcing aligned access, where possible, can allow code to run faster and more efficiently. This needs to be balanced with other concerns like portability across architectures.
Understanding alignment requirements and detecting unaligned access issues, particularly in lower level code, continues to be an important consideration for optimization on many platforms.