The Cortex-A76 CPU core from ARM offers a significant performance boost over previous ARM CPU cores like the Cortex-A73 and Cortex-A75. Based on ARM’s testing, the Cortex-A76 achieves up to 35% higher performance than the Cortex-A75 under the same power budget. Real-world performance will vary depending on factors like workload, system implementation, and power configuration, but overall the Cortex-A76 represents a sizable leap forward in ARM CPU performance.
Overview of the Cortex-A76
The Cortex-A76 CPU core is based on ARM’s newest generation DynamIQ technology and built on a 7nm manufacturing process. It uses ARM’s new A76 microarchitecture designed for significantly higher performance while maintaining power efficiency. The A76 microarchitecture introduces several optimizations and enhancements compared to prior ARM CPU cores:
- Increased instruction throughput via a 6-wide decode and rename stage, compared to a 4-wide design in previous cores like the A75.
- Improved branch prediction accuracy.
- Larger out-of-order execution window capabilities, allowing more instructions to be reordered and executed in parallel.
- Enhanced floating point and vector execution units.
- Faster L1 cache access latency.
- Higher memory bandwidth with support for LPDDR4X.
In addition, the A76 microarchitecture was designed from the ground up for higher frequencies at the same power budget. All of these improvements translate to a CPU core with industry-leading performance for mobile and embedded devices.
Performance Compared to Cortex-A73 and A75
Compared to its direct predecessor the Cortex-A75, ARM claims the Cortex-A76 achieves up to 35% better performance at the same power level. Much of these gains come from the microarchitectural enhancements and the capability to reach higher frequencies within the same power envelope.
The performance improvements are even greater when compared to the older generation Cortex-A73. ARM testing showed the Cortex-A76 performs up to 45% better than the Cortex-A73 in mobile-class workloads. Again, this uplift comes from both the microarchitectural optimizations as well as higher operating frequencies.
Real-world devices will likely see performance improvements of 25-35% on average when upgrading from Cortex-A75 or A73 to Cortex-A76 CPUs. However, actual speedups will depend on factors like:
- The system-on-chip design and implementation.
- Workload mix and optimization.
- Thermal and power budgets.
- Manufacturing process and binning.
Well-designed systems that maximize performance within acceptable thermal limits can achieve speedups approaching ARM’s tested limits. Either way, the Cortex-A76 delivers a healthy improvement over previous mid-range ARM CPU cores.
CPU Performance Relative to Desktop CPUs
The Cortex-A76 is designed specifically for power-constrained mobile and embedded applications. It is not intended to compete directly with higher power desktop and laptop x86 CPUs from Intel and AMD.
Comparing the Cortex-A76 to current desktop CPUs using common performance benchmarks shows there is still a sizable performance gap in favor of x86 chips for certain workloads:
- In single-threaded workloads, a modern Intel Core i7 or AMD Ryzen 7 CPU is about 2-3x faster than the Cortex-A76.
- In multi-threaded workloads, high core count desktop CPUs maintain a healthy lead over the Cortex-A76.
- Desktop CPUs also have support for advanced features like AVX vector extensions which can accelerate math-heavy code.
However, mobile-focused CPUs have advantages in other areas like idle/low power efficiency. The latest ARM cores are extremely competitive relative to desktop x86 when performance is viewed in the context of a restricted power budget. ARM is rapidly narrowing the performance gap with each new CPU generation as well.
Real-World Cortex-A76 Performance Examples
While vendor testing and benchmarks can provide an idea of the Cortex-A76’s capabilities, looking at real-world devices gives a practical view of performance:
- Huawei Kirin 980 – First announced SoC with Cortex-A76 cores. The Kirin 980 combines two high-performance A76 cores operating up to 2.6 GHz with four 1.92 GHz A55 cores for efficiency. Benchmark results show significant gains over Kirin 970.
- Samsung Exynos 9820 – Powers the Galaxy S10 lineup. With two 2.73 GHz A76 cores and two 2.31 GHz A75 cores, the Exynos 9820 puts up excellent benchmark numbers, showcasing the A76’s performance.
- Qualcomm Snapdragon 855 – Qualcomm’s flagship SoC relies on semi-custom Kryo 485 cores derived from A76. The Snapdragon 855 outperforms most chips on Android devices while maintaining high efficiency.
These SoCs demonstrate the performance and efficiency potential of the Cortex-A76 in mobile devices. Real-world usage matches the expectations set by ARM’s own testing.
Performance Scaling Trends
The Cortex-A76 continues ARM’s rapid cadence of performance improvements with each core generation. Some key trends include:
- 10-15% per generation performance gains at same power level.
- Average 15% power efficiency increase per generation.
- Sustained boosts to maximum operating frequencies.
- Faster overall execution from microarchitectural enhancements.
ARM expects these trends to continue with future iterations of the CPU core roadmap. The Cortex-A77 successor already shows another 10-15% performance jump over the A76 under similar power constraints.
This steady cadence of CPU performance scaling helps enable significant user experience improvements with each product generation. Mobile applications and use cases can leverage the expanding performance within restricted mobile power budgets.
Performance Configurations
The Cortex-A76 supports several performance “gears” that allow tuning the CPU for different workloads and power scenarios:
- Maximum Performance – Highest single-threaded and multi-threaded performance but with higher power draw.
- Mid Performance – Balanced setup offering power-efficient performance for bursts.
- Efficient Performance – Lower peak frequencies but increased energy efficiency for sustained workloads.
- Minimum Performance – Heavily restricts performance in favor of lowest possible power for intermittent tasks.
Devices can dynamically shift between these modes based on factors like workload intensity, thermal headroom, and battery level. This flexibility is critical to maximizing real-world user experience within an SoC’s constraints.
Power Efficiency
Despite the large performance leap over prior cores, the Cortex-A76 maintains and even improves upon efficiency. ARM targeted an average 10-15% increase in energy efficiency versus the older Cortex-A75. These gains come from both the processor design itself and the 7nm manufacturing node.
At lower performance modes, the A76 operates within a very frugal 2-3W power envelope. Even at maximum performance, it stays within reasonable 4-5W power draw for short bursts. This level of efficiency is critical for mobile devices where battery life and thermal dissipation are limited.
Area Efficiency
ARM cites only a 0-5% increase in die area for the Cortex-A76 design versus the Cortex-A75. The company achieved this by balancing additional complexity required for higher performance with tuned transistor densities afforded by the 7nm process.
Maintaining a compact core area footprint allows SoC vendors to integrate the Cortex-A76 while leaving room for additional functionality. This area efficiency is key as chip fabrication costs rise at leading-edge process nodes.
Performance Density
Performance density measures performance per mm2 of die area. The Cortex-A76 sets new standards for mobile-class CPU core performance density thanks to its combination of high efficiency and high performance.
Compared to the Cortex-A75, ARM says the A76 delivers up to 20% greater performance density. This metric quantifies the potency of each mm2 the core uses and showcases ARM’s technical design expertise.
Comparison to Apple’s CPU Cores
In recent years, Apple’s custom-designed mobile CPU cores have set benchmarks for both CPU performance and efficiency. How does ARM’s off-the-shelf Cortex-A76 compare?
Apple’s latest A13 Bionic SoC powers the iPhone 11 series. Its two high-performance CPU cores deliver excellent performance, but a few key differences exist relative to the Cortex-A76:
- Apple’s CPUs still appear to have a small single-threaded performance lead.
- The A13 Bionic uses 20% higher peak power draw, affording more thermal headroom.
- TSMC’s 7nm process used by Apple may have a density edge over the 7nm node used for A76.
However, the gap has narrowed substantially. The Cortex-A76 offers compelling performance while retaining a different focus on power efficiency and a standard design approach.
Summary: Extremely Fast for a Mobile CPU
In summary, the Cortex-A76 represents a new pinnacle for mobile CPU performance today. It makes significant strides over prior ARM cores and rivals Apple’s CPU efforts. Exact speedups depend on implementation factors, but the Cortex-A76 easily delivers laptop-class performance in mobile power constraints.
Looking forward, ARM’s steady execution on its CPU roadmap inspires confidence that new Cortex-A core generations will continue driving mobile performance upwards through 2021 and beyond.
Cortex-A76 Frequency Operating Ranges
The Cortex-A76 is designed to support a wide range of operating frequencies from 1 GHz to over 3 GHz. Exact frequencies depend on the manufacturing process, SoC implementation, and device OEM tuning.
Some typical frequency ranges for the Cortex-A76 include:
- 1 GHz to 1.8 GHz for lower power operation.
- 1.8 GHz to 2.5 GHz for mainstream mobile performance.
- 2.5 GHz to 3 GHz+ for maximized performance.
Higher-end smartphone and mobile chips released in late 2018/early 2019 generally operate Cortex-A76 cores in the 2.5 GHz to 2.8 GHz range. This balances high performance, power efficiency, and stability.
Frequencies can scale above 3 GHz in specialty configurations focused solely on maximized burst performance. ARM expects the core to scale to at least 3.3 GHz on future manufacturing processes assuming adequate power and thermal headroom is available.
Voltage and Power Ranges
The Cortex-A76 voltage and power characteristics include:
- Minimum voltage around 0.65V (depends on process).
- Peak voltage from 0.9V to 1.1V for high performance.
- Ultra-low power modes can operate around 2-3W.
- Mainstream configurations typically run at 3-5W range.
- Max performance modes can reach up to 5-7W power draw.
These voltages and power figures apply to the CPU cores themselves. Total SoC power also depends on other integrated components.
To maximize power efficiency, most mobile devices operate in the lower 1 GHz to 2 GHz range during normal use. Higher frequencies and voltages kick in temporarily for short performance bursts.
Process Nodes
The Cortex-A76 is designed to be manufactured on advanced 7nm or smaller process nodes from foundries like TSMC and Samsung. Some key process details include:
- First implementations on 7nm processes in 2018.
- Future migration to more advanced nodes like 6nm, 5nm, and 4nm over next few years.
- 7nm provides a 20% reduction in die size versus 10nm process.
- Smaller nodes boost efficiency and allow higher frequencies.
The progress to ever-smaller transistor geometries ensures ARM CPU cores like the A76 will maintain their performance scaling and power efficiency curve for years.
SoC Integration Examples
As a licensed IP core, ARM CPU cores like the Cortex-A76 require integration into full system-on-chips (SoCs) to be put into commercial use. Some example SoCs using the A76 include:
- HiSilicon Kirin 980 – First announced SoC integrating Cortex-A76 CPUs along with other ARM IP. Used in Huawei Mate 20 series phones.
- Samsung Exynos 9820 – Powers Samsung’s Galaxy S10 lineup. Combines A76 and A75 CPUs along with custom GPU and NPU.
- Qualcomm Snapdragon 855 – Top-tier SoC for 2019 Android flagships. Uses semi-custom Kryo 485 cores based heavily on the A76.
These SoCs all leverage the Cortex-A76’s performance to provide significant user experience improvements while maintaining high power efficiency.
Workload Performance
The Cortex-A76 provides substantial performance improvements over prior mid-range ARM cores across a wide range of mobile workloads including:
- Web browsing – 15-25% faster page loading and smoother scrolling.
- Apps & Games – Up to 2x faster frame rates and smoother UI interactions.
- Multimedia – Supports 4K HDR playback and high-bitrate streaming with minimal power.
- Productivity Apps – Significantly accelerated document editing, file compression/decompression, etc.
- Digital Assistant Tasks – Faster processing of voice commands and on-device queries.
Real-world testing shows significant user experience gains in day-to-day mobile device use. Much of this comes from sustained mid-range performance improvements rather than just peak burst speeds.
Benchmarks
Cortex-A76 performance can be quantified and compared using common computing benchmarks like:
- Geekbench – General compute benchmark covering both single-threaded and multi-threaded performance.
- Antutu – Mixed workload simulation translating to real-world use cases.
- 3DMark – Intensive 3D graphics and gaming benchmark.
- MLPerf – Suite of AI/machine learning training and inference benchmarks.
Top-tier mobile SoCs integrating the Cortex-A76 like the Snapdragon 855 post excellent benchmark scores across the board, showcasing both CPU and overall platform performance.
Performance Counters
The Cortex-A76 incorporates extensive performance counters to help developers optimize software and workloads. Key events that can be profiled include:
- Cache hit/miss rates (L1/L2)
- Pipeline stalls
- Branch mispredictions
- Memory bandwidth
- Instructions per cycle (IPC)
By leveraging these CPU performance counters in profiling tools, developers can gain insight into performance bottlenecks and memory access patterns. This helps guide optimization work.
Workload Optimization
Software developers can take advantage of the Cortex-A76 performance in their applications by optimizing for mobile workloads. Some techniques include:
- Enabling vectorization and parallelization.
- Tuning memory access patterns to improve cache locality.
- Reducing unnecessary memory copies and allocations.
- Avoiding costly branches with predicated execution.