The ARM Cortex-A75 and Cortex-A76 are two of ARM’s most powerful CPU cores designed for mobile devices. The Cortex-A76 succeeds the Cortex-A75 and brings significant improvements in performance and efficiency.
In summary, the key differences between the Cortex-A75 and Cortex-A76 are:
- The Cortex-A76 is built on the newer 7nm manufacturing process, while the A75 uses 10nm process.
- The A76 has improved branch prediction and prefetching capabilities.
- The A76 supports higher memory bandwidth with support for LPDDR4X memory.
- The A76 is designed for higher clock speeds up to 3GHz, while the A75 tops out at 2.8GHz.
- The A76 has better power efficiency and delivers 20% more performance at the same power level.
- The A76 features a larger out-of-order execution engine capable of dispatching more instructions per cycle.
- The A76 has enhancements to the floating point and NEON processing units for better machine learning performance.
Manufacturing Process
The Cortex-A75 is built on TSMC’s 10nm FinFET process, while the Cortex-A76 uses the more advanced 7nm FinFET process. The smaller transistor size of 7nm allows for higher density and power efficiency.
The 7nm process results in an overall 35% reduction in power consumption at the same performance level compared to 10nm. Alternatively, it enables 20% higher performance at the same power. This gives smartphone makers more headroom to push CPU performance while retaining battery life.
Microarchitecture
Both the Cortex-A75 and A76 are based on ARM’s DynamIQ technology which enables heterogeneous multi-core configurations. This allows mixing different types of CPU cores like big, medium and small cores to balance performance and power efficiency.
The Cortex-A75 has a 7-way superscalar pipeline capable of dispatching 7 instructions per cycle. In comparison, the Cortex-A76 increases this to an 8-way pipeline, dispatching 8 instructions per cycle which gives it an inherent advantage in throughput.
Branch Prediction
An accurate branch predictor is critical to keeping the pipeline filled and minimizing stalls due to mispredictions. The Cortex-A76 incorporates a new conditional branch predictor which improves the branch prediction accuracy compared to the A75.
In addition, the A76 increases the size of the branch target address cache (BTAC) which caches branch targets. This further reduces pipeline flushes due to branch mispredictions.
Prefetching
The Cortex-A76 also implements more aggressive prefetching to bring data into the caches before it is needed. It has a larger and more advanced prefetcher compared to Cortex-A75 and is able to prefetch wider streams of data.
This helps prevent pipeline bubbles where the CPU has to wait for data from memory. Prefetching is especially beneficial for improving performance of memory-intensive workloads.
Out-of-Order Execution
Both the Cortex-A75 and A76 use out-of-order execution and are capable of executing instructions outside of program order to maximize performance. However, the A76 has a significantly larger reorder buffer and reservation station enabling it to have more instructions in flight.
The A76 can hold over 200 instructions in flight compared to 128 instructions for the A75. This larger out-of-order execution window enables the A76 extract more instruction level parallelism from code.
Floating Point / NEON
ARM has enhanced the floating point and NEON SIMD processing units in the Cortex-A76 to handle more computational workloads. The floating point unit has increased dispatch and retirement bandwidth, allowing more FLOPS per cycle.
The NEON unit has also been expanded to handle more workloads related to machine learning, image processing and other vector workloads common in mobile apps.
Memory Subsystem
The Cortex-A76 memory subsystem has been optimized to sustain higher memory bandwidth and reduce latency. It supports fast LPDDR4X-4266 memory delivering up to 16GB/s of bandwidth. The A75 supports a maximum bandwidth of LPDDR4-3733 memory.
To enable this increased bandwidth, ARM improved the load/store queues to enable more outstanding memory requests. The L1 data cache read bandwidth has also been enhanced.
In addition, the A76 speeds critical loads and reduces latency by allowing more outstanding cache misses to DRAM. This enables it to hide memory latency more effectively.
Performance and Power Efficiency
In performance comparisons done by ARM, the Cortex-A76 delivers approximately 20% higher performance versus the Cortex-A75 on a clock-for-clock basis. Alternately, it can provide the same performance as the A75 while reducing power consumption by 35%.
The A76 is designed to scale up to higher clock speeds up to 3.0 GHz on 7nm process, while the A75 tops out at 2.8 GHz. Combined with microarchitectural improvements, this enables the A76 to hit higher peak performance levels.
For multithreaded workloads, ARM claims the A76 can achieve up to 35% higher performance over the A75. The design optimizations around higher memory bandwidth, larger out-of-order engine and improved branch prediction allow the A76 to excel at multithreaded and server-style workloads.
The Cortex-A76 maintains the same 4-wide decode as the A75 but required less die area for the core, providing a more efficient implementation. ARM also reduced voltage needed to hit various frequency targets, further enhancing power efficiency.
Real World Implementations
The Cortex-A76 first appeared in TSMC 7nm form in the Kirin 980 SoC made by Huawei. The Kirin 980 features two high performance A76 cores clocked up to 2.6 GHz alongside two A76 medium cores and four A55 efficiency cores.
Compared to the Kirin 970 with four Cortex-A73 cores, the Kirin 980 shows approximately 20% gains in both single and multi-threaded CPU benchmarks thanks to the Cortex-A76 cores. This aligns closely with ARM’s performance claims.
Qualcomm also uses Cortex-A76 cores in its high-end Snapdragon 855 SoC. The Snapdragon 855 implements three customized A76 cores called Kryo 485. These cores are clocked up to 2.84 GHz and deliver a 25-30% increase in CPU performance versus Snapdragon 845 with A75-based cores.
The Exynos 9820 and 9825 SoCs from Samsung also integrate A76 cores clocked up to 2.7 GHz. Overall, the real world performance uplift lines up with expectations set by ARM’s microarchitecture improvements from Cortex-A75 to A76.
Conclusion
The Cortex-A76 brings significant performance and power efficiency gains over the previous generation Cortex-A75 CPU core. With improvements across the board from manufacturing process, microarchitecture, memory subsystem and floating point performance, the A76 represents a major evolution for ARM.
In mobile SoCs like the Kirin 980 and Snapdragon 855, the Cortex-A76 cores clearly demonstrate sizable performance gains over older A73 and A75 based designs. The Cortex-A76 gives mobile chip designers headroom to continue improving CPU performance as we move forward to 7nm and smaller fabrication processes.