The CortexA76 processor from ARM includes a dedicated cryptographic extension for accelerating cryptographic operations. This extension provides significant performance improvements for cryptography compared to executing cryptography on the main CPU cores. This article provides a comprehensive guide on how to use the cryptographic extension in the CortexA76.
Introduction to the CortexA76 Cryptographic Extension
The cryptographic extension in the CortexA76 is designed to accelerate operations like encryption, decryption, hashing, and MAC generation. It includes hardware acceleration for algorithms like AES, SHA1, SHA2, and SHA3. The extension uses a modular architecture which allows new algorithms and modes to be easily added in the future via microcode updates.
The cryptographic extension uses instruction set extensions to the ARMv8.2 architecture to provide the cryptography acceleration. New cryptographic instructions are added that can be executed on the cryptographic engine. For example, there are AES encrypt/decrypt instructions like AESD and AESMC. There are also SHA1/SHA2 hash instructions like SHA1H and SHA256H. Using these new instructions allows cryptographic workloads to be offloaded from the CPU cores onto the dedicated cryptographic hardware.
Some key benefits of using the CortexA76 cryptographic extension include:
- Much higher performance for cryptography workloads
- Reduces power consumption by offloading cryptographic processing from CPU cores
- Provides hardware acceleration for multiple cryptography algorithms and modes
- New algorithms can be added via microcode updates
Overall the cryptographic extension aims to provide significant performance and efficiency benefits for security workloads on ARM processors. Next we’ll cover how to make use of the cryptographic extension in more detail.
Enabling and Accessing the Cryptographic Extension
To start using the cryptographic extension, it first needs to be enabled when configuring the system. This is done by setting the CPACR_EL1.FPEN bits in the CPACR_EL1 system register to 0b11. For example: MSR CPACR_EL1, #0x300000 // Enable FP/ASIMD, CPACR_EL1.FPEN = 0b11
With the extension enabled, the cryptographic instructions can now be executed on supported CortexA76 processors. The extension acts as an additional execution unit alongside the CPU cores. When a cryptographic instruction is executed, it will be internally routed to the cryptographic engine rather than the CPU cores.
The cryptographic extension contains several hardware registers that are used to control its operation and transfer data. These include key registers for encryption keys, input/output data registers, control registers, and more. For example, there are registers like CRYPTO_AES_KEY2_n which hold AES keys.
These registers are accessed using the CRYPTO_n pseudonyms in assembly code. For example: LDR x0, =CRYPTO_SHA1_INPUT_0 // Load input data address
The specific registers available will depend on the CortexA76 configuration. They can be used to set keys, provide input data, get output data, set modes/lengths, and trigger operations. We’ll look at example usage of these registers later on.
Cryptographic Extension Programming
With the extension enabled, operations can be offloaded to it by using the new cryptographic instructions. Here are some examples of how to use the extension in code.
AES Encryption
AES encryption can be performed using the AESD (AES single round decrypt) and AESMC (AES single round encrypt) instructions. Here is an example of AES-128 encryption in ECB mode: LDR x0, =input_data // Input plaintext LDR x1, =output_data // Output ciphertext LDR x2, =AES_128_key // Input key LDP x16, x17, [x2] // Load AES-128 key STP x16, x17, [CRYPTO_AES_KEY2_0] // Set key 1: LD1 {v0.16b}, [x0], #16 // Load 16 bytes of input AESD v1.16b, v0.16b // Decrypt with key in HW register AESMC v1.16b, v1.16b // Encrypt with key in HW register ST1 {v1.16b}, [x1], #16 // Store output SUBS x3, x3, #16 // Decrement count B.NE 1b // Loop until done
This performs AES-128 encryption on 16 byte blocks of data using the hardware key registers and AES instructions. Any AES key size and mode can be implemented via additional software logic.
SHA256 Hashing
The SHA256 secure hash algorithm can be calculated using the SHA256H and SHA256SU0 instructions: LDR x0, =input_data // Input data to hash LDR x1, =output_hash // Output hash value MOV x16, #0 // Initialize hash state STP xzr, xzr, [CRYPTO_SHA2_STATE0_0] // Clear state registers STP x16, x16, [CRYPTO_SHA2_STATE0_4] 1: LDR x2, [x0], #64 // Load 64 bytes of input data SHA256SU0 v0.4s, v1.4s // Schedule input data SHA256H q0, q1, v0.4s // Perform hashing SUBS x3, x3, #64 // Decrement count B.NE 1b // Loop until done LDP x0, x1, [CRYPTO_SHA2_STATE0_0] // Get hash result STP x0, x1, [x1]
This hashes the input data 64 bytes at a time using the SHA256 instructions. The result is written to the output hash buffer. Different SHA algorithms have similar instructions.
SHA3 Hashing
The SHA3/Keccak secure hash can be implemented using the SHA3 instructions: LDR x0, =input_data // Input data LDR x1, =output_hash // Output hash MOV x16, #0 // Clear state STP x16, x16, [CRYPTO_SHA3_STATE_0] 1: LDR x2, [x0], #64 // Load input SHA3RND.64B v2, v3, v4 // SHA3 round function SUBS x3, x3, #64 // Decrement count B.NE 1b // Loop until done LDP x0, x1, [CRYPTO_SHA3_STATE_0] // Get hash STP x0, x1, [x1]
The SHA3RND instruction performs the SHA3 round functions on the state. Different capacities have different round instructions.
HMAC Authentication
HMAC message authentication can be implemented efficiently using the cryptographic extension. For example, HMAC-SHA256 can be calculated as: // Load key LDR x0, =hmac_key LDP x2, x3, [x0] STP x2, x3, [CRYPTO_SHA2_HMAC_KEY_0] // Calculate inner hash LDR x0, =input_data BL SHA256_hash_hw // Calculate outer hash LDR x0, =input_data BL SHA256_hash_hw // Load HMAC result LDP x0, x1, [CRYPTO_SHA2_STATE0_0] STP x0, x1, [output_hmac]
The cryptographic extension allows computing the inner and outer hashes used in HMAC without transfers back to the CPU. Other MAC algorithms like CMAC can also be implemented efficiently.
Optimizing Performance
There are several aspects to consider when optimizing performance of cryptographic workloads on the CortexA76 extension:
- Use larger input blocks rather than byte-at-a-time processing
- Minimize transfers between CPU and crypto extension
- Maximize parallelism between CPU and crypto ops
- Balance work between available cores and crypto engines
- Use multiple keys to allow overlapping CPU and crypto work
- Consider memory allocation to maximize throughput
Larger input blocks better utilize the parallel processing in the hardware. For example processing SHA256 in 64 byte blocks rather than byte-at-a-time. Multi-buffering using multiple keys allows overlapping CPU processing with encryption/decryption.
The processor’s memory system should be configured to maximize data throughput to the cryptographic extension. This includes utilizing cache, TLBs, bus bandwidth, and prefetching/write buffering appropriately.
For workloads not fully utilizing the CPU cores, performance may be increased by doing more software pre-processing on the CPU before offloading cryptography to the extension engines.
Cryptographic Extension vs Main CPU Cores
While the cryptographic extension provides significant performance benefits over CPU-only cryptography, there are still tradeoffs to consider in utilizing it effectively:
- Small requests may be faster on CPU than crypto extension
- Requires data transfers between CPU and extension
- CPU cores freed for other processing while crypto runs
- CPU pre-processing can setup larger requests to extension
- Supports limited set of algorithms and modes in hardware
For small requests or single blocks, the overhead of setting up and transferring to the cryptographic extension may outweigh its benefit over CPU-only processing. But larger requests gain significant performance from the dedicated hardware.
The CPU cores are freed up while crypto operations run on the extension engines. The CPU can preprocess data to batch requests and feed the pipeline for steady utilization.
The fixed-function hardware only supports certain algorithms and modes directly. Some workloads may need additional software processing for unsupported modes like GCM encryption.
Overall the cryptographic extension is highly beneficial for large volumes of cryptography processing. But integration with CPU processing is needed for optimal utilization and performance.
Comparison to Other ARM Processors
The CortexA76 cryptographic extension builds on capabilities first introduced in earlier ARM processor versions like the CortexA72 and CortexA73:
- CortexA72 – Initial cryptographic extensions for AES, SHA1, SHA2
- CortexA73 – Added SHA3/SHA512, DES, and Kasumi algorithms
- CortexA76 – Higher throughput, additional algorithms and modes
The CortexA76 improves throughput substantially over previous processors. Peak rates can be over 2x the throughput of CortexA73 depending on algorithm. Latency is also improved for most operations.
Additional supported algorithms were added with CortexA76 like SM3, SM4, Kasumi, and Snow3G. It also added algorithm modes like AES-XTS, AES-CCM, and GCM.
Future ARM processor generations will likely continue improving cryptographic extension performance and capabilities.
Conclusion
The cryptographic extension in the CortexA76 provides substantial benefits for security workloads. Dedicated hardware provides significant throughput improvements over CPU-only cryptography.
To utilize the extension effectively, algorithms should use the new cryptographic instructions to offload work to the hardware engines. Code should aim to maximize parallelism and feed the pipeline at steady state.
There are still tradeoffs to consider versus CPU-only processing. But for large volumes of cryptography, the CortexA76 cryptographic extension delivers excellent performance and efficiency.
As cryptography becomes more ubiquitous, processor optimizations like the CortexA76 extension will become increasingly important. Dedicated hardware will enable security processing to scale across everything from Internet of Things devices up to cloud data center servers.