SoC
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
  • Arm Cortex M3
  • Contact
Reading: Arm WFF Instruction
SUBSCRIBE
SoCSoC
Font ResizerAa
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Search
  • Home
  • Arm
  • Arm Cortex M0/M0+
  • Arm Cortex M4
Have an existing account? Sign In
Follow US
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
© S-O-C.ORG, All Rights Reserved.
Arm

Arm WFF Instruction

Ryan Ryan
Last updated: September 14, 2023 11:46 am
Ryan Ryan 6 Min Read
Share
SHARE

The ARM WFF instruction stands for Wireless Fast Forwarding instruction. It is used to optimize data movement within ARM-based systems-on-chip (SoCs). The WFF instruction provides a fast path for data transfer between compute clusters without going through the central processing unit (CPU). This improves overall throughput and efficiency.

Contents
What is WFF?How Does WFF Work?WFF Instruction SyntaxBenefits of Using WFFUse CasesProcessor SupportCompiler SupportProgramming with WFFConclusion

What is WFF?

WFF or Wireless Fast Forwarding is a feature of the ARM Compute System Architecture (CSA). It allows direct data transfer between different compute clusters such as CPU, GPU, NPU, DSP etc. within an ARM SoC. This avoids unnecessary trips through the CPU or main memory and reduces latency and power consumption.

The key enabler for WFF is the ARM CoreLink mesh fabric interconnect. This connects all the components of the SoC. The interconnect and caches are extended to support WFF instructions which can directly move data from one cluster to another autonomously.

How Does WFF Work?

Here are the key steps involved in ARM WFF:

  1. A compute cluster such as a GPU generates data and the destination is another cluster like an NPU.
  2. The source cluster sends a WFF instruction to the interconnect instead of the CPU.
  3. The interconnect decodes this instruction and sets up a direct data path to the destination cluster using the mesh fabric.
  4. Data is transferred directly between the clusters without CPU involvement.
  5. The interconnect handles any coherency issues if the caches are involved.

In essence, WFF provides a shortcut for data transfers within the SoC. The CPU is bypassed for most of the data movement. This saves power and reduces latency.

WFF Instruction Syntax

The WFF instruction follows this syntax: WFF , , ,

Where:

  • destination_cluster_id: ID of the destination cluster
  • source_address: Address in source cluster from where data has to be moved
  • destination_address: Address in destination cluster where data has to be written
  • size: Amount of data to be transferred in bytes

For example: WFF 0x2, 0x4000, 0x1000, 0x800 ;Move 2KB data from GPU (cluster 2) ;address 0x4000 to NPU (cluster 3) ;address 0x1000

Benefits of Using WFF

Here are some of the benefits of using the WFF instruction in ARM SoCs:

  • Faster data transfers – Bypassing CPU and direct cluster to cluster transfer improves throughput.
  • Lower latency for inter-cluster communication.
  • Reduced power consumption by avoiding unnecessary memory or CPU trips.
  • Better efficiency and performance for workloads using multiple IPs.
  • Simpler programming model – WFF provides abstraction above the interconnect.
  • Better concurrency – Overlapped execution while data is in transfer.
  • Cache coherency automatically handled by the interconnect.

Use Cases

Some common use cases where WFF can help improve efficiency are:

  • Machine Learning – Transferring tensor data between GPU and NPU for training and inference.
  • Image Processing – Moving raw image data from ISP to GPU or DSP for processing.
  • Networking – Direct transfer between I/O clusters and compute clusters.
  • Sensor Fusion – Combining sensor data from various sources before processing.

WFF is beneficial in most scenarios where heterogeneous processing happens involving multiple specialist clusters within the ARM SoC.

Processor Support

WFF requires hardware support in the ARM processor as well as the interconnect. Here are some ARM processors that support WFF instructions:

  • ARM Cortex-A65AE – Next gen 5nm flagship CPU for smartphones
  • ARM Cortex-A510 – Mid-range DynamIQ CPU for auto & industrial
  • ARM Cortex-A78C – High perf CPU with ARMv9 and ML support
  • ARM Cortex-X2 – Flagship big.LITTLE CPU for highest performance

These processors support the updated ARMv9 instruction set which includes WFF. They are paired with CoreLink CMN-700 or CMN-600 series interconnect fabrics that enable WFF functionality.

Compiler Support

To make use of WFF instructions, compiler support is needed. The ARM Compiler 6.17 and newer versions include support for WFF instructions. The popular open source GCC and LLVM compilers also support generating WFF instructions in their latest releases.

With compiler support, the WFF instruction can be automatically generated where optimal instead of manual insertion. The compiler analyses data flows and inserts WFF instructions when beneficial.

Programming with WFF

To programmatically use WFF instructions, here are some guidelines:

  • Identify data parallelism within your code across clusters.
  • Partition computation to maximize data locality.
  • Offload parallel work units to supporting clusters.
  • Use WFF to transfer data to destination cluster as needed.
  • Overlapped execution – Do useful work when data is in transfer.
  • Use coherent access if caches are involved.
  • Let compilers optimize WFF usage automatically.

Proper usage of WFF requires analysis of compute and data flow patterns. Computing suitable work units for each cluster while maximizing data locality is key.

Conclusion

The ARM Wireless Fast Forwarding instruction improves efficiency in heterogeneous SoCs by enabling direct data transfers between compute clusters. WFF avoids unnecessary CPU trips and reduces latency. It is supported in the latest ARM processors and interconnects. Compilers can further automate optimal WFF insertion. Overall, WFF provides developers more flexibility in partitioning workloads across specialized processing units.

Newsletter Form (#3)

More ARM insights right in your inbox

 


Share This Article
Facebook Twitter Email Copy Link Print
Previous Article Arm Sleep Instruction
Next Article ARM Cortex M4 Watchdog
Leave a comment Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

2k Followers Like
3k Followers Follow
10.1k Followers Pin
- Sponsored-
Ad image

You Might Also Like

Cortex-MO FPGA

Field Programmable Gate Arrays (FPGAs) based on ARM Cortex-M processor…

6 Min Read

What is the Arm Cortex startup sequence?

The Arm Cortex startup sequence refers to the steps that…

7 Min Read
Arm

What is ARM Cortex-M85?

The ARM Cortex-M85 is the latest and most advanced microcontroller…

5 Min Read

ARM Cortex-M4 Programming

The ARM Cortex-M4 is a 32-bit processor core commonly used…

8 Min Read
SoCSoC
  • Looking for Something?
  • Privacy Policy
  • About Us
  • Sitemap
  • Contact Us
Welcome Back!

Sign in to your account