The ARM WFF instruction stands for Wireless Fast Forwarding instruction. It is used to optimize data movement within ARM-based systems-on-chip (SoCs). The WFF instruction provides a fast path for data transfer between compute clusters without going through the central processing unit (CPU). This improves overall throughput and efficiency.
What is WFF?
WFF or Wireless Fast Forwarding is a feature of the ARM Compute System Architecture (CSA). It allows direct data transfer between different compute clusters such as CPU, GPU, NPU, DSP etc. within an ARM SoC. This avoids unnecessary trips through the CPU or main memory and reduces latency and power consumption.
The key enabler for WFF is the ARM CoreLink mesh fabric interconnect. This connects all the components of the SoC. The interconnect and caches are extended to support WFF instructions which can directly move data from one cluster to another autonomously.
How Does WFF Work?
Here are the key steps involved in ARM WFF:
- A compute cluster such as a GPU generates data and the destination is another cluster like an NPU.
- The source cluster sends a WFF instruction to the interconnect instead of the CPU.
- The interconnect decodes this instruction and sets up a direct data path to the destination cluster using the mesh fabric.
- Data is transferred directly between the clusters without CPU involvement.
- The interconnect handles any coherency issues if the caches are involved.
In essence, WFF provides a shortcut for data transfers within the SoC. The CPU is bypassed for most of the data movement. This saves power and reduces latency.
WFF Instruction Syntax
The WFF instruction follows this syntax: WFF , , ,
- destination_cluster_id: ID of the destination cluster
- source_address: Address in source cluster from where data has to be moved
- destination_address: Address in destination cluster where data has to be written
- size: Amount of data to be transferred in bytes
For example: WFF 0x2, 0x4000, 0x1000, 0x800 ;Move 2KB data from GPU (cluster 2) ;address 0x4000 to NPU (cluster 3) ;address 0x1000
Benefits of Using WFF
Here are some of the benefits of using the WFF instruction in ARM SoCs:
- Faster data transfers – Bypassing CPU and direct cluster to cluster transfer improves throughput.
- Lower latency for inter-cluster communication.
- Reduced power consumption by avoiding unnecessary memory or CPU trips.
- Better efficiency and performance for workloads using multiple IPs.
- Simpler programming model – WFF provides abstraction above the interconnect.
- Better concurrency – Overlapped execution while data is in transfer.
- Cache coherency automatically handled by the interconnect.
Some common use cases where WFF can help improve efficiency are:
- Machine Learning – Transferring tensor data between GPU and NPU for training and inference.
- Image Processing – Moving raw image data from ISP to GPU or DSP for processing.
- Networking – Direct transfer between I/O clusters and compute clusters.
- Sensor Fusion – Combining sensor data from various sources before processing.
WFF is beneficial in most scenarios where heterogeneous processing happens involving multiple specialist clusters within the ARM SoC.
WFF requires hardware support in the ARM processor as well as the interconnect. Here are some ARM processors that support WFF instructions:
- ARM Cortex-A65AE – Next gen 5nm flagship CPU for smartphones
- ARM Cortex-A510 – Mid-range DynamIQ CPU for auto & industrial
- ARM Cortex-A78C – High perf CPU with ARMv9 and ML support
- ARM Cortex-X2 – Flagship big.LITTLE CPU for highest performance
These processors support the updated ARMv9 instruction set which includes WFF. They are paired with CoreLink CMN-700 or CMN-600 series interconnect fabrics that enable WFF functionality.
To make use of WFF instructions, compiler support is needed. The ARM Compiler 6.17 and newer versions include support for WFF instructions. The popular open source GCC and LLVM compilers also support generating WFF instructions in their latest releases.
With compiler support, the WFF instruction can be automatically generated where optimal instead of manual insertion. The compiler analyses data flows and inserts WFF instructions when beneficial.
Programming with WFF
To programmatically use WFF instructions, here are some guidelines:
- Identify data parallelism within your code across clusters.
- Partition computation to maximize data locality.
- Offload parallel work units to supporting clusters.
- Use WFF to transfer data to destination cluster as needed.
- Overlapped execution – Do useful work when data is in transfer.
- Use coherent access if caches are involved.
- Let compilers optimize WFF usage automatically.
Proper usage of WFF requires analysis of compute and data flow patterns. Computing suitable work units for each cluster while maximizing data locality is key.
The ARM Wireless Fast Forwarding instruction improves efficiency in heterogeneous SoCs by enabling direct data transfers between compute clusters. WFF avoids unnecessary CPU trips and reduces latency. It is supported in the latest ARM processors and interconnects. Compilers can further automate optimal WFF insertion. Overall, WFF provides developers more flexibility in partitioning workloads across specialized processing units.