A hard fault on an ARM Cortex processor is an unrecoverable error that causes the processor to enter an exception state and halt normal program execution. Hard faults indicate serious problems like hardware failures, memory faults, or invalid instruction execution that cannot be handled gracefully by the system. Identifying the root cause of a hard fault is key to resolving issues and restoring proper functionality.
There are several potential causes of hard faults on ARM Cortex chips:
Invalid memory access
One major cause of hard faults is invalid memory accesses. This occurs when code attempts to read or write to restricted regions of memory or access memory using invalid addresses. Examples include:
- Accessing null pointer addresses
- Reading or writing outside array bounds
- Executing code from invalid addresses
- Stack overflow errors corrupting the stack memory
Memory faults generate a MemManage exception which escalates to a hard fault if unhandled. Enabling the Memory Management Unit (MMU) and programming it correctly is key to avoiding invalid memory access faults.
Unaligned memory access
Unaligned memory accesses attempt to read or write data on addresses that are not integer multiples of the data size. For example, a 32-bit read from address 0x123 would be unaligned. The Cortex-M3 and Cortex-M4 do not support unaligned accesses which will lead to a hard fault on those processors.
Aligning data structures properly and avoiding type-casting structs can prevent this issue. Setting the SCB_CCR.UNALIGN_TRP bit can also trap unaligned accesses and prevent a hard fault.
Integer divide by zero
Division by zero is an illegal operation which causes a hard fault exception on ARM Cortex chips. This includes the SDIV and UDIV divide instructions operating on a zero denominator at runtime. Rigorously checking operands to avoid division by zero prevents such hard faults.
Invalid instructions and opcode issues
Execution of undefined or invalid opcodes can generate a UsageFault exception that escalates to a hard fault. Potential causes include:
- Memory corruption changing instruction opcodes
- Jumping to non-executable memory addresses
- Improper code modifications via JTAG/SWD
- Unsupported coprocessor instruction exceptions
- Disabled extension opcodes like SIMD/DSP when running legacy code
Enabling the MPU to limit instruction execution to verified memory regions can mitigate invalid opcode related hard faults.
Stack overflows
The processor stack contains return addresses, function parameters, and local variables allocated on subroutine calls. Stack overflows due to excessive nesting, recursive calls, large stack allocations etc. can overwrite other memory regions. This causes a MemManage fault escalating to a hard fault exception.
Stack overflows can be avoided by:
- Increasing the stack size appropriately
- Profiling stack usage to catch overflow issues
- Minimizing large stack allocations
- Avoiding infinite loops and runaway recursion
Floating point exceptions
The Cortex-M4 and Cortex-M7 cores include hardware floating point units. Floating point code may generate exceptions like divide-by-zero, underflow, overflow, invalid operation etc. These are escalated to UsageFault or BusFault exceptions, causing a hard fault if unhandled.
Proper input validation and checking return codes after FP instructions can catch these exceptions early before they escalate to hard faults.
Bus faults
Bus faults indicate an error occurred during instruction or data bus transactions. These could arise from:
- External memory errors – ECC errors, timing violations
- Flash memory errors – ECC errors, access timing issues
- System bus contention with peripherals leading to wait state violations
- Memory controller configuration issues – incorrect timing parameters
Bus faults can be debugged by checking memory interfaces and buses for electrical or timing issues. The ARM CoreSight components like ETM trace can help record bus transactions leading up to the fault.
Undefined exceptions
Undefined exceptions (UND faults) occur on attempt to execute an undefined instruction for the current processor state. For example:
- Attempting to execute ARM instruction on a Thumb-only core
- Conditional instruction that fails its condition code check
- Changed processor state to ARM, then executing undefined Thumb instruction
Avoiding intermixing of ARM and Thumb instructions and checking condition flags can prevent undefined exceptions.
Debug events
The debug module can trigger debug events like breakpoints, watchpoints, vector catches etc. They generate a debug exception which escalates to a hard fault if left unhandled. Properly disabling debug mode before code release prevents these. The FAULTMASK register can also be used to suppress debug induced hard faults if debug is enabled.
OS task switching errors
In RTOS based systems, task switching can sometimes trigger hard faults. Common causes include:
- Stack overflow during task switch corrupting stack memory
- Switching tasks while interrupts are disabled
- Task priorities causing deadlock and stalling the scheduler
- Trying to switch to invalid or non-existing tasks
Analysis of the task switching patterns and scheduler state helps isolate OS related hard faults.
Power, clock and EMI issues
Incorrect power or clock configurations can also lead to hard faults. Examples include:
- Brownout issues corrupting processor state during voltage drops
- PLL losing lock due to board noise or poor layout
- Clock glitches during system state changes
- Excessive Electromagnetic Interference (EMI) disrupting processor operation
Careful review of power supply stability, clock trees, and board layout is needed to identify potential faults from these sources.
Identifying Root Cause
When a hard fault exception occurs, the ARM Cortex processor halts execution and enters the hard fault handler. Register and stack contents provide crucial clues on the fault origin:
- HFSR – HardFault Status Register indicates source of hard fault
- CFSR – Configurable Fault Status Register gives fault status of MMFSR, BFSR, UFSR
- MMFAR – Memory Manage Fault Address Register indicates fault address for memory related faults
- BFAR – Bus Fault Address Register indicates fault address for bus faults
- PC – Program Counter indicates instruction that triggered the fault
- LR – Link register points to calling function’s return address
- Stacked registers and local variables help recreate full context
Trace outputs from CoreSight components like Embedded Trace Macrocell (ETM) or Data Watchpoint and Trace (DWT) unit can also provide detailed history of program flow, data access, bus transactions etc. leading up to the fault event.
For hard faults during development, debuggers like Segger Ozone, Eclipse IDE, and proprietary IDEs provide debug, inspection and tracing tools. For faults after deployment, on-chip profiling via CoreSight STM or System Trace Macrocell (STM) can prove invaluable.
With the root cause identified, developers can apply fixes like firmware upgrades, hardware design changes, or software patches to resolve underlying issues and prevent future hard fault occurrences.
In summary, hard faults on ARM Cortex processors can arise from a range of software and hardware issues. Thoughtful programming and robust system design can eliminate many common causes. Duplicate hard faults point to systemic underlying problems that require dedicated investigation, analysis and remediation to address.