Hard Fault behavior - timing, randomness, root causes

A Hard Fault on an ARM Cortex chip refers to an unrecoverable exception that occurs when the processor detects an error condition that it cannot handle gracefully. Hard Faults result in a complete halt of normal program execution, requiring a reset or power cycle to recover. Understanding the timing, randomness, and root causes of Hard Faults is critical for debugging and resolving issues in Cortex-based systems.

Contents

When Do Hard Faults Occur?Hardware vs Software Triggered Hard Faults When Do They NOT Occur?Root Causes of Hard Faults Invalid Memory Accesses Unaligned Memory Accesses Divide By Zero FPU Errors Unhandled Exceptions Stack Overflow Critical System Errors Identifying the Root Cause Timing and Randomness Factors Asynchronous Nature Intermittent Fault Conditions Complex Hardware Interactions Heisenbugs Strategies for Dealing with Timing and Randomness Summary

When Do Hard Faults Occur?

Hard Faults can occur at any time during program execution on a Cortex CPU. Unlike interrupts which occur at predictable times, Hard Faults are asynchronous and can happen randomly. The processor immediately stops what it is doing, saves context state to the stack, and jumps to the Hard Fault handler. This results in an abrupt and unexpected halt in the software flow. From the software perspective, a Hard Fault has timing and randomness similar to an unexpected reset.

Hardware vs Software Triggered Hard Faults

Hard Faults originate from both hardware and software causes. Hardware issues like clock errors, power glitches, and electrical noise can corrupt processor state and trigger a Hard Fault. These occurrences are rare and random in nature. Software bugs like dereferencing invalid pointers, infinite loops, and stack overflows are much more common sources of Hard Faults. While the root causes may differ, the processor handles both hardware and software triggered faults the same way – by immediately halting execution.

When Do They NOT Occur?

Hard Faults do not occur when the processor is already busy handling a higher priority exception. For example, if a Hard Fault event occurs during an interrupt handler, it will be pended until the interrupt returns. This allows existing critical exception handlers to complete before the Hard Fault is taken. Hard Faults also do not occur when the processor is in power down sleep modes. Any pending faults will be recognized immediately after wakeup.

Root Causes of Hard Faults

There are several common root causes of Hard Faults in Cortex-M systems:

Invalid Memory Accesses

Accessing invalid memory locations outside of accessible Flash or RAM regions will trigger a Hard Fault. This includes NULL pointer dereferences, stack overflows, and out of bounds array accesses. Enabling the MPU to limit memory regions can help detect invalid accesses.

Unaligned Memory Accesses

Unaligned accesses that do not meet the data alignment requirements of the Cortex-M processor will fault. Common examples are unaligned 32-bit reads on Cortex-M0/M0+/M1 parts without bus fault support.

Divide By Zero

A divide by zero exception will trigger a Hard Fault. Software should check divisors for 0 before performing divide operations.

FPU Errors

Using a disabled FPU or illegal floating point instructions will cause a Hard Fault on Cortex-M4/M7 chips. Make sure to enable the FPU before use.

Unhandled Exceptions

Exceptions without a handler configured will escalate to a Hard Fault. The default handlers like NMI, MemManage, and BusFault should have handlers set up by software.

Stack Overflow

Stack overflows from runaway recursion or large stack frames can corrupt memory and trigger a Hard Fault. Monitoring stack usage and limiting stack depth can help avoid overflows.

Critical System Errors

Critical system errors like RAM parity errors, clock issues, and memory protection errors will Hard Fault. These are usually complex hardware related faults.

Identifying the Root Cause

Identifying the specific root cause of a Hard Fault often requires debugging with a debugger like GDB or IDE, analyzing crash dumps, and/or adding logging and assertions in code. Some techniques include:

Inspecting the stacked PC value to identify the fault location
Checking CFSR registers for fault status flags

Enabling fault diagnostics like MemManage handler
Tracing instruction execution to replay crashes

Locating the first point of failure helps narrow down the root cause. Stack overflows may first manifest as a MemManage fault before escalating to a Hard Fault for example. Having good debugging tools, crash logs, and diagnostic handlers set up is crucial for effective root cause analysis.

Timing and Randomness Factors

From a troubleshooting perspective, the two most challenging attributes of Hard Faults are their timing and apparent randomness. The timing and randomness factors can be explained by a few reasons:

Asynchronous Nature

Hard Faults originate from asynchronous events like illegal memory accesses, exceptions, and hardware errors. These can occur at any point in program execution, unlike synchronous exceptions like interrupts.

Intermittent Fault Conditions

Issues like power supply noise, marginal RAM, and temperature fluctuations can cause intermittent faults. The same code may run fine billions of times before a unique combination of conditions triggers a fault.

Complex Hardware Interactions

In complex SoCs, hardware blocks like the memory controller, bus interconnects, clocking, and power domains all interact, often non-deterministically. This can create chaos theory-like “butterfly effects” that add to apparent randomness.

Heisenbugs

“Heisenbugs” are problems that seem to disappear when debugging tools are applied. The added trace logic, lower speeds, and ideal lab conditions mask the underlying issue.

Strategies for Dealing with Timing and Randomness

Despite the challenges posed by the timing and randomness aspects, a systematic approach can help uncover the root causes of Hard Faults:

Log fault stack frames, error codes, and runtime trace data
Stress test components like RAM and Flash to force latent faults
Increase code assertions, traces, and defensive checks

Reproduce on real production hardware, not just simulators
Evaluate firmware, libraries, stacks, and compilers for defects
Consider statistical and probability analysis of failure conditions

Thorough testing under varied operating conditions along with sufficient instrumentation code and debug visibility is key to overcoming the apparent randomness of these faults during development.

Summary

To summarize, Hard Faults on ARM Cortex-M processors represent an unrecoverable halting of the core in response to exceptional conditions in hardware or software. The timing appears random since faults originate from asynchronous events. Real root causes range from invalid memory access, divide-by-zero, uninitialized handlers, stack corruption, to complex hardware failures. Identifying specific fault causes requires rigorous debugging techniques and methodologies given the randomness factor. A combination of test instrumentation, failure analysis, stress testing, and hardware debugging helps overcome these challenges during the development process.

Hard Fault behavior – timing, randomness, root causes