The ARM Cortex-A76 and Cortex-A77 are two of ARM’s most advanced high-performance CPU cores designed for mobile, computing, and infrastructure applications. Both cores are based on ARM’s DynamIQ technology and built on the 7nm manufacturing process, offering significant improvements in performance and efficiency over previous ARM CPU generations.
The Cortex-A77 builds upon the Cortex-A76 design and provides further performance optimizations. The key differences between the two cores come down to microarchitecture enhancements, larger cache sizes, and higher clock speeds with Cortex-A77. Let’s take a deeper look at how these two ARM CPU cores compare.
Microarchitecture
The Cortex-A76 and A77 CPU cores have similar underlying DynamIQ microarchitectures optimized for sustained performance and responsiveness. However, ARM made several enhancements to the microarchitecture in the A77 for better branch prediction, instruction prefetching, and instruction reordering.
Both cores are 8-wide superscalar with out-of-order execution, meaning they can dispatch up to 8 instructions per cycle to the various execution units and reorder instructions to maximize performance. The A77 improves the instruction fetch bandwidth from 4 instructions to 5 instructions per cycle to keep the wider execution core fed with instructions.
The A77 also increased the size of the micro-op cache from 2048 to 4096 entries to optimize decode throughput. Deeper pipelines and optimized branch prediction algorithms further improve instruction flow in the A77 core. Overall, the microarchitecture optimizations provide a 5% gain in IPC (instructions per cycle) over the A76.
Memory Subsystem
The memory subsystem receives major upgrades in the Cortex-A77 for sustaining higher performance. The A77 doubles the L1 instruction and data cache sizes from 48KB to 64KB compared to the A76. This reduces miss rates and keeps the execution units busy.
ARM also increased the L2 cache size in the A77 up to 512KB per core (from 256KB in A76). Larger L2 caches further minimize trips to the slower main memory. The A77 core can also sustain higher bandwidth to the L2 cache, enabling faster data access.
For improved data localization, the A77 integrates data preload instructions and optimized prefetching algorithms. This brings data closer to the execution units and hides memory latency. The A77 memory subsystem enhancements overall provide around 10-15% performance uplift over the A76.
Floating Point/SIMD
To boost floating point and vector math performance, ARM added a fourth 128-bit NEON processing lane in the Cortex-A77 core (up from three 128-bit lanes in A76). NEON provides SIMD (single instruction, multiple data) capabilities for multimedia, imaging, ML, and scientific workloads.
With the extra NEON lane, the A77 can perform up to eight 16-bit or four 32-bit floating point operations per cycle. Doubled L1 data cache bandwidth also keeps the NEON lanes fed with data. Overall, the A77 achieves over 30% higher floating point throughput compared to the A76.
Performance and Efficiency
The microarchitecture improvements in the Cortex-A77 translate to significant gains in processor performance over the A76. ARM claims around 20% better absolute performance on SPECint benchmarks for the A77 core. On more advanced ML workloads, the gains can be even higher (up to 35% faster).
The A77 achieves the performance gains while still maintaining the same power efficiency as the A76. Both cores were designed for mobile SoCs and strict thermal budgets. Single-thread performance efficiency is around 4.3 specINT/GHz per core for both the A76 and A77.
However, the A77 does stretch the power budget for higher multi-threaded performance. The extra NEON lane adds more execution capacity for parallel workloads. The A77 achieves around 15% higher multi-thread efficiency versus the A76 within the same power envelope.
Process Nodes
An important factor in the performance gains with Cortex-A77 is the transition to 7nm FinFET process technology (from 10nm/8nm for A76). The smaller 7nm transistors enable higher clock speeds within the same power budget.
ARM designed both the A76 and A77 to scale from 1.2 GHz to over 3 GHz across different process nodes. However, the A77 on 7nm can stretch higher for more performance. Qualcomm’s Snapdragon 865 uses Cortex-A77 cores clocked up to 2.84 GHz on 7nm.
Real-World Devices
Due to the timing of the release, the Cortex-A76 ended up in more mobile chip designs from vendors like Samsung, Qualcomm, and Huawei. Samsung used A76 cores in its Exynos 9820 and 9825 SoCs for the Galaxy S10 and Note 10 series.
Qualcomm featured A76 cores in its Snapdragon 855 (up to 2.84 GHz), powering devices like the Galaxy S10 and Pixel 4. Huawei’s Kirin 990 5G used A76 cores clocked at up to 2.86 GHz. MediaTek also adopted A76 cores for its 5G Dimensity chipsets.
For the next generation of flagship devices, OEMs switched over to the Cortex-A77 for a performance boost. Samsung’s Exynos 990 and Qualcomm’s Snapdragon 865 both utilize A77 cores. Qualcomm managed to push the A77 up to 2.84 GHz on 7nm, delivering 25% higher CPU performance over the A76-based Snapdragon 855.
In terms of battery life characteristics, phones with A77 cores saw modest gains over A76 devices or were similar. The transition to 7nm helped offset the higher peak A77 performance. But other factors like display, modem, GPU, and software optimizations also impact battery life in actual devices.
Performance Cores
The Cortex-A76 and Cortex-A77 are performance-oriented cores designed by ARM for smartphone, laptop, and server applications requiring high single-threaded and multi-threaded CPU throughput. The cores are meant to be used along with other power efficient cores in a DynamIQ big.LITTLE configuration.
For mobile, the A76 and A77 would pair with Cortex-A55 power efficient cores. In laptops and servers, ARM promotes pairing high performance Neoverse N1 cores with A76 or A77 cores for optimized throughput and power.
Future Outlook
ARM continued evolving its mobile CPU core performance with the launch of the Cortex-X1 CPU in 2020. The Cortex-X1 provides additional gains in IPC and frequency versus the A77 for next-generation mobile SoCs. Qualcomm is using Cortex-X1 on its new Snapdragon 8 Gen 1 chipset.
For laptops and servers, ARM recently announced its next-gen Cortex-A710 CPU that is designed for even higher throughput performance over A76/A77-based solutions. The Cortex-A710 forms the basis of ARM’s new v9 architecture and will likely show up in 2022 products.
While the Cortex-A77 has now been succeeded by newer CPU designs, it still delivers excellent performance and efficiency for mobile and embedded applications. The A77 and A76 will continue powering smartphones and other devices for years to come.
Summary
In summary, the key differences between the ARM Cortex-A76 and Cortex-A77 CPU cores include:
- Enhanced DynamIQ microarchitecture with improved branch prediction, instruction prefetching, and reorder buffer for 5% higher IPC on A77
- Larger 64KB L1 instruction/data caches and up to 512KB L2 cache on A77 versus 48KB/48KB L1 and 256KB L2 on A76
- Extra NEON lane on A77 boosts floating point and SIMD performance by over 30%
- Built on the 7nm process, A77 stretches higher in frequency for 20%+ faster performance
- A77 achieves performance gains with similar single-thread power efficiency as A76
- Real-world devices show A77 outperforming A76 processors by around 25%
In the end, the Cortex-A77 meets ARM’s goal of delivering significantly better performance and energy efficiency over the A76 generation for mobile and emerging applications. While not revolutionary, the steady improvements with each core design from ARM enable the performance and battery life smartphone users demand.