What is unaligned memory access?

Unaligned memory access refers to reading data from or writing data to memory locations that are not multiples of the word size. For example, on a 32-bit architecture, accessing a 4-byte integer at an address that is not divisible by 4 would be considered unaligned. This is in contrast to aligned memory access where data is read from or written to addresses that are multiples of the word size.

Contents

Crossing Memory Boundaries Unoptimized Data Paths Additional Logic for Realignment Cache Line Splits Performance Penalty Examples Unaligned Access in C Programs Support for Unaligned Accesses Mitigating Unaligned Accesses Alignment Requirements Tools for Detecting Unaligned Accesses Alternatives to Unaligned Accesses in Hardware Conclusion

Most modern CPUs require aligned memory access for optimum performance. Accessing unaligned data can significantly slow down processing as it requires the CPU to perform extra operations to retrieve the data. There are several reasons why unaligned access leads to a performance penalty:

Crossing Memory Boundaries

When accessing unaligned data, the CPU likely has to read from two different aligned blocks of memory and combine the parts to construct the desired data. For example, to read a 4-byte integer from address 0x10003, the CPU would have to read the last byte from the block at 0x10000 and the first three bytes from 0x10004. This requires extra instructions and memory operations compared to reading aligned data within a single block.

Unoptimized Data Paths

Most CPUs and memory subsystems are optimized for aligned access patterns. Data buses, caches, and prefetchers all assume aligned addresses. Unaligned accesses disrupt these optimizations and force slower path ways to be used. For example, certain 64-bit registers and data buses may not be usable for unaligned 8-byte quantities.

Additional Logic for Realignment

After reading unaligned data from memory, the CPU then has to realign it before usage. This requires shifting and masking logic to put the parts together into the expected aligned format. Similarly, storing unaligned data requires breaking apart the value and masking bits before writing the parts out to memory.

Cache Line Splits

Unaligned data may straddle cache line boundaries and require an extra memory access to retrieve. Most CPUs fetch entire cache lines from memory into the cache. If the unaligned data spans two cache lines, both need to be fetched even if only part of each line is required.

Performance Penalty Examples

As an example, benchmarks on ARM processors have shown unaligned 32-bit integer loads to be 2-3x slower than aligned loads. On x86, unaligned loads were 1.5-2x slower.Stores of unaligned data were similarly penalized. The performance impact also extends to unaligned access of larger data types like 64-bit integers and floats.

One benchmark test showed sequential unaligned 4-byte integer reads having a throughput of ~160 MB/s. The same test with aligned reads achieved ~440 MB/s – over 2.7x higher. Optimization such as using SIMD instructions may have limited benefits for unaligned data.

Unaligned Access in C Programs

In languages like C/C++, there are sometimes requirements for specific data alignment. For example, creating a packed struct with chars followed by ints could result in unaligned ints relative to word boundaries. Calling malloc in C/C++ does not guarantee any particular alignment for the returned pointer.

Accessing fields within misaligned structs or indexing into buffers with unaligned pointers typically triggers unaligned accesses. This can lead to mysterious performance issues in applications. Care should be taken to enforce proper alignment where needed through techniques like padding fields and using aligned allocation functions.

Support for Unaligned Accesses

Some architectures like x86 silently handle unaligned accesses in hardware, reducing the performance penalty. This may lower motivation for programmers to enforce aligned access patterns in software. However, it comes at a cost of increased hardware complexity in the memory subsystem.

Other architectures like ARM and RISC-V have stricter alignment requirements and generally do not transparently support unaligned access in hardware. On these platforms, unaligned access typically results in a processor exception or bus error. Enforcing aligned access is especially important for portable code intended to run on such architectures.

Mitigating Unaligned Accesses

There are several techniques programmers can adopt to avoid unaligned memory accesses:

Use data types that have alignment requirements matching their size (e.g. uint32_t instead of char[4])
Allocate data at addresses aligned to its size, such as with memalign or posix_memalign

Use compiler directives like __attribute__((aligned(N))) to enforce alignments
Pad structs to align member fields
Avoid casts that reinterpret buffers and break alignments

Use unions to overlay differently aligned data

Checking for alignment before accessing data and handling misalignment separately can also avoid penalizing all accesses. Some architectures provide special unaligned load/store instructions to handle misaligned data more efficiently in hardware.

Alignment Requirements

Different architectures have varying specific alignment requirements. Some examples:

x86 – aligned on any byte address
ARM – words aligned to 4-byte addresses
RISC-V – words aligned to 4-byte addresses

PowerPC – words aligned to 4-byte addresses
MIPS – words aligned to 4-byte addresses
SPARC – words aligned to 8-byte addresses

Larger data types like doubles and long longs have proportional alignment requirements, e.g. 8-byte alignment on 32-bit architectures. Structs and arrays also inherit alignment based on their contents and size.

Tools for Detecting Unaligned Accesses

There are tools available to help detect and diagnose unaligned memory access issues:

Compiler warnings – Enable alignment warnings and treat them as errors

Sanitizers – Address sanitizer can detect unaligned accesses
Debuggers – Can catch alignment fault exceptions
Disassemblers – View effective address offsets and instruction patterns

Performance analyzers – Correlate slow functions with unaligned access
Memory checkers – Valgrind has misaligned memory access detection

Runtime instrumentation and binary analysis tools can also help uncover unaligned access issues in production software.

Alternatives to Unaligned Accesses in Hardware

Some techniques hardware can employ to reduce the performance impact of unaligned accesses include:

Wider memory interfaces – Fetch extra adjacent bytes into caches
Multi-byte extract instructions – Pick correct bytes from wider registers

Barrel shifters – Hardware realignment before usage
Adaptive alignment logic – Automatically detect and realign
Multi-bank caches – Reduce conflicts from spans

Speculative accesses – Hide realignment latency

However, these increase cost and complexity. Requiring aligned access in software is usually a better trade-off.

Conclusion

Unaligned memory access can significantly impact performance on many processor architectures. Accessing data at alignments that do not match the size of the data type forces extra complex operations under the hood in both hardware and software. Enforcing aligned access, where possible, can allow code to run faster and more efficiently. This needs to be balanced with other concerns like portability across architectures.

Understanding alignment requirements and detecting unaligned access issues, particularly in lower level code, continues to be an important consideration for optimization on many platforms.

What is unaligned memory access?

Crossing Memory Boundaries

Unoptimized Data Paths

Additional Logic for Realignment

Cache Line Splits

Performance Penalty Examples

Unaligned Access in C Programs

Support for Unaligned Accesses

Mitigating Unaligned Accesses

Alignment Requirements

Tools for Detecting Unaligned Accesses

Alternatives to Unaligned Accesses in Hardware

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

What is Serial Wire Viewer (SWV) in Arm Cortex-M?

Flash Patch and Breakpoint Unit (FPB) in Arm Cortex-M Explained

Arm Cortex-M DAP bus and interconnect architecture Explained

Controlling Clocks and PLL for Power Savings in Cortex-M3