Arm WFF Instruction

The ARM WFF instruction stands for Wireless Fast Forwarding instruction. It is used to optimize data movement within ARM-based systems-on-chip (SoCs). The WFF instruction provides a fast path for data transfer between compute clusters without going through the central processing unit (CPU). This improves overall throughput and efficiency.

Contents

What is WFF?How Does WFF Work?WFF Instruction Syntax Benefits of Using WFF Use Cases Processor Support Compiler Support Programming with WFF Conclusion

What is WFF?

WFF or Wireless Fast Forwarding is a feature of the ARM Compute System Architecture (CSA). It allows direct data transfer between different compute clusters such as CPU, GPU, NPU, DSP etc. within an ARM SoC. This avoids unnecessary trips through the CPU or main memory and reduces latency and power consumption.

The key enabler for WFF is the ARM CoreLink mesh fabric interconnect. This connects all the components of the SoC. The interconnect and caches are extended to support WFF instructions which can directly move data from one cluster to another autonomously.

How Does WFF Work?

Here are the key steps involved in ARM WFF:

A compute cluster such as a GPU generates data and the destination is another cluster like an NPU.
The source cluster sends a WFF instruction to the interconnect instead of the CPU.

The interconnect decodes this instruction and sets up a direct data path to the destination cluster using the mesh fabric.
Data is transferred directly between the clusters without CPU involvement.
The interconnect handles any coherency issues if the caches are involved.

In essence, WFF provides a shortcut for data transfers within the SoC. The CPU is bypassed for most of the data movement. This saves power and reduces latency.

WFF Instruction Syntax

The WFF instruction follows this syntax: WFF , , ,

Where:

destination_cluster_id: ID of the destination cluster
source_address: Address in source cluster from where data has to be moved
destination_address: Address in destination cluster where data has to be written

size: Amount of data to be transferred in bytes

For example: WFF 0x2, 0x4000, 0x1000, 0x800 ;Move 2KB data from GPU (cluster 2) ;address 0x4000 to NPU (cluster 3) ;address 0x1000

Benefits of Using WFF

Here are some of the benefits of using the WFF instruction in ARM SoCs:

Faster data transfers – Bypassing CPU and direct cluster to cluster transfer improves throughput.
Lower latency for inter-cluster communication.
Reduced power consumption by avoiding unnecessary memory or CPU trips.

Better efficiency and performance for workloads using multiple IPs.
Simpler programming model – WFF provides abstraction above the interconnect.
Better concurrency – Overlapped execution while data is in transfer.

Cache coherency automatically handled by the interconnect.

Use Cases

Some common use cases where WFF can help improve efficiency are:

Machine Learning – Transferring tensor data between GPU and NPU for training and inference.

Image Processing – Moving raw image data from ISP to GPU or DSP for processing.
Networking – Direct transfer between I/O clusters and compute clusters.
Sensor Fusion – Combining sensor data from various sources before processing.

WFF is beneficial in most scenarios where heterogeneous processing happens involving multiple specialist clusters within the ARM SoC.

Processor Support

WFF requires hardware support in the ARM processor as well as the interconnect. Here are some ARM processors that support WFF instructions:

ARM Cortex-A65AE – Next gen 5nm flagship CPU for smartphones

ARM Cortex-A510 – Mid-range DynamIQ CPU for auto & industrial
ARM Cortex-A78C – High perf CPU with ARMv9 and ML support
ARM Cortex-X2 – Flagship big.LITTLE CPU for highest performance

These processors support the updated ARMv9 instruction set which includes WFF. They are paired with CoreLink CMN-700 or CMN-600 series interconnect fabrics that enable WFF functionality.

Compiler Support

To make use of WFF instructions, compiler support is needed. The ARM Compiler 6.17 and newer versions include support for WFF instructions. The popular open source GCC and LLVM compilers also support generating WFF instructions in their latest releases.

With compiler support, the WFF instruction can be automatically generated where optimal instead of manual insertion. The compiler analyses data flows and inserts WFF instructions when beneficial.

Programming with WFF

To programmatically use WFF instructions, here are some guidelines:

Identify data parallelism within your code across clusters.
Partition computation to maximize data locality.

Offload parallel work units to supporting clusters.
Use WFF to transfer data to destination cluster as needed.
Overlapped execution – Do useful work when data is in transfer.

Use coherent access if caches are involved.
Let compilers optimize WFF usage automatically.

Proper usage of WFF requires analysis of compute and data flow patterns. Computing suitable work units for each cluster while maximizing data locality is key.

Conclusion

The ARM Wireless Fast Forwarding instruction improves efficiency in heterogeneous SoCs by enabling direct data transfers between compute clusters. WFF avoids unnecessary CPU trips and reduces latency. It is supported in the latest ARM processors and interconnects. Compilers can further automate optimal WFF insertion. Overall, WFF provides developers more flexibility in partitioning workloads across specialized processing units.

Arm WFF Instruction

What is WFF?

How Does WFF Work?

WFF Instruction Syntax

Benefits of Using WFF

Use Cases

Processor Support

Compiler Support

Programming with WFF

Conclusion

More ARM insights right in your inbox

Leave a Reply Cancel reply

You Might Also Like

ARM Application Binary Interface

ARM Cortex M Boot Sequence

How to Boot Cortex-M3 STM32F1 from RAM?

Differences between LDR and STR