The Cortex-M1 processor has an internal tightly coupled memory (ITCM) that allows for lower latency access compared to external RAM. When executing code and data from ITCM, the Cortex-M1 can achieve single-cycle access. However, when accessing external RAM, it experiences higher latency due to factors like bus contention, wait states, and caching.
ITCM Overview
The ITCM on the Cortex-M1 is an on-chip SRAM that is tightly integrated with the processor core. It sits on the AMBA AHB bus and acts as high-speed instruction and data memory. The key characteristics of ITCM are:
- Size ranges from 4KB to 64KB
- Single-cycle access latency
- Zero wait state memory
- Dedicated bus interface to core
- Enables deterministic real-time task execution
The tight coupling means there is a dedicated bus interface between the core and ITCM, separate from the system bus. This removes the need for arbitration and buffering, enabling single-cycle zero wait state access. Code and data stored in ITCM can be accessed rapidly without variability.
External RAM Access Latency
In contrast to ITCM, external RAM connected to the Cortex-M1 via the AMBA AHB bus experiences higher variable latency. Main sources of added latency include:
- Bus contention – Core competes with DMA and peripherals
- Wait states – External RAM slower than processor clock
- Buffering – Bus transfers broken into bursts
- Caching – Cache misses induce stall cycles
The AHB bus can only service one master at a time. So the core may have wait cycles if a DMA or peripheral is using the bus. Wait states are added by the controller to compensate for slow external RAM timings. Buffering and caching add latency but can improve throughput.
Bus Contention
As a shared resource, the AHB bus may be busy servicing requests from other bus masters when the Cortex-M1 core tries to access external memory. For example, a DMA transfer to move data between peripherals and RAM would occupy the bus during the transfer. This blocks the core from using it, adding latency.
Arbitration is used to decide which master gets access if multiple request simultaneously. The arbiter follows a fixed priority scheme, with the core usually having the highest priority. But it still must wait its turn, delaying access to external RAM.
Wait States
To compensate for slower external RAM timings, wait states are added during transfers over the AHB bus. For example, the Cortex-M1 has a 3-stage pipeline that can produce a new bus request every clock cycle. But the RAM may require 50ns access time, whereas the CPU clock is only 10ns.
So the bus controller inserts idle cycles (wait states) after the address/control signals are sent, to give the RAM time to respond. A single wait state means wasting one clock cycle per access in this example. More wait states are needed for slower memories.
Buffering
Buffering refers to splitting bus transfers into smaller burst transactions. Instead of doing full width aligned accesses for each read/write, narrower bursts are used. This adds latency but improves bus efficiency and memory throughput.
For example, a 32-bit word fetch could be broken into four 8-bit bursts. This requires initiating 4 bus transfers instead of 1, adding overhead cycles. But the narrower bursts allow interleaving, avoiding wasted bandwidth during transitions.
Caching
The Cortex-M1 optionally supports a 4KB instruction cache, but no data cache. When enabled, the cache reduces average access latency in exchange for more variability. A cache hit results in fast single-cycle latency to retrieve a cached code word. But a cache miss means stalling the core pipeline while fetching from external memory.
If the instruction working set fits in the 4KB cache, hit rates can be high. But conflict misses will occur as cache lines are evicted by new fetches. Cache performance also depends heavily on software behavior like loops and branches.
Measuring Latency
To compare ITCM and external RAM access times, test software can directly measure read latency by timing memory accesses. This can be done with code like:
start_timer()
data = read_memory(address)
end_timer()
latency = get_elapsed_cycles()
Repeating the measurement and averaging many samples will give good precision. The timer must have sufficient accuracy to measure single-cycle differences. Cycle counting features built into the Cortex-M1 make this possible.
Another approach is to load a software benchmark into ITCM and external RAM separately. The benchmark should do deterministic operations like tight loops with counter increments. Comparing the execution times gives overall latency including instruction fetches.
Typical Results
Exact latency numbers will depend on the Cortex-M1 clock speed and external RAM characteristics. But relative differences highlight the ITCM advantage:
- ITCM: 1 cycle access latency
- External RAM: 10+ cycles latency
The 10x lower ITCM access time is significant for performance. Loading code into ITCM avoids wait states and stalling for cache misses. Tight loops with many iterations will benefit the most from single-cycle ITCM fetches.
For data accesses, ITCM can similarly provide 5-10x lower latency than cached external RAM. This reduces pipeline stalls for load/store instructions. DMA transfers will also see faster access times when using ITCM buffers.
Optimizing for Low Latency
To leverage the low ITCM latency, developers need to carefully structure their Cortex-M1 software. Tips include:
- Place performance critical code and data in ITCM
- Use ITCM for DMA buffers
- Reduce working set size to fit in ITCM
- Manually control any caching
- Minimize wait states when accessing external RAM
neural machine translation models being run directly on Cortex-M series chips for low power applications like hearing aids and wireless earbuds.
The key to enabling complex neural networks is the use of software frameworks like uTensor that optimize models for microcontrollers. Quantization, pruning, and other techniques shrink model sizes to fit within MCU constraints.
Cortex-M4 and M7 chips provide DSP extensions and FPUs to accelerate ML workloads. Tightly coupled memories reduce latency for inference operations. DMA and interrupt handling assist with real-time audio processing.
Cloud offloading is still useful for large models like image recognition that are too demanding for MCU inferencing. But for audio and simple classification tasks, on-device ML is becoming viable with efficient embedded-focused frameworks.
As Moore’s law slows, squeezing more performance per watt is critical in the IoT era. So direct neural network inferencing on Cortex-M chips will enable exciting new applications at the edge.
The Cortex-M1 was an early ARM processor aimed at embedded and microcontroller applications. Released in 2004, it established the Cortex-M series now extensively used in IoT and edge devices. Understanding the latency characteristics of its ITCM and external RAM accesses enables developers to optimize software performance on this foundational MCU architecture.