How to Track Down Hard Faults Caused by Context Switching?

Context switching is the process where the processor switches from one thread to another. This involves storing the state of the current thread and loading the state of the next thread to be executed. While context switching is essential for multitasking, it can sometimes lead to hard faults that crash the system if not handled properly.

Contents

Understanding Context Switching

In a multitasking operating system, the processor needs to switch between various threads and processes running on the system. For example, one thread may be waiting for user input while another thread is doing file I/O in the background. Context switching allows the processor to pause execution of one thread and resume execution of another when needed.

During a context switch, the processor has to save the current state of the thread which includes the values in the registers including the program counter and stack pointer. It then has to load the register values of the next thread so it can resume from where it left off. This state saving and restoring allows seamless switching between threads.

Common Causes of Hard Faults

While context switching is usually handled automatically by the operating system, problems can occur if not done properly leading to hard faults. Here are some common cases:

Stack Overflow

Each thread has its own stack which stores the local variables, function parameters and return addresses. A stack overflow occurs when the stack grows beyond its allocated memory and starts to overwrite other memory areas. This most commonly happens due to infinite recursion or very deep nested function calls.

Stack Collision

If two stacks grows into each other, it leads to stack collision. This happens if the stack sizes are underestimated during allocation. When context switching between the two threads, their stacks will start to corrupt each other’s data.

Corrupted Stack

A stack can get corrupted if an array overrun or buffer overflow occurs within a thread. This can overwrite the stored stack frames with incorrect data. When context switching back to this thread, the processor will try to pop the corrupted stack.

Incomplete Context Saving

All the thread’s registers and processor state needs to be saved during context switch. If the context saving code misses out on certain registers, their values will be lost leading to unexpected errors when switching back.

Debugging Hard Faults

When a hard fault occurs on ARM Cortex-M devices, the processor switches to Handler mode and executes the HardFault_Handler. We can leverage this handler to debug the root cause of the hard fault.

Check Stack Limits

One of the first things to check is the stack pointer values on each thread to see if they are within expected limits. The stack pointer value stored in the PSP register during context switch can reveal if there was a stack overflow.

Inspect Stack Contents

The contents of the stack for each active thread can be inspected to check for errors. Look for intact stack frames, return addresses pointing to valid code regions and parameter values that make sense.

Verify Saved Context

Step through the context saving code and verify that all the required registers are being saved before a switch. Pay close attention to registers like the program counter, stack pointer and link registers.

Check I/O and Peripherals

Faulty hardware and incorrect use of peripherals like UART, I2C can also lead to hard faults. Examine the status registers of different peripherals to detect errors.

Monitor for Memory Corruption

Tools like memory protection units can detect buffer overflows and out of bound accesses. A memory profiler can help identify memory leaks and fragmentation issues.

Preventative Measures

Some best practices to avoid context switch related hard faults:

Allocating Proper Stack Size

Underestimating stack usage is one of the common reasons for stack overflows. Properly determine the worst case stack usage for each thread and allocate size accordingly.

Avoid Nested Calls

Deep nested function calls quickly eat up stack space. Refactor code to limit nesting where possible.

Using Stack Canaries

Stack canaries are values placed between stack frames that can detect corrupted stacks if overwritten. Enable canaries before context switches.

Limiting Critical Sections

Long critical sections where interrupts are disabled can cause stack overflows. Try to minimize duration and restrict usage.

Adding Assertions

Add assertions in code to check for violations like out of bound accesses that corrupt stack or cause buffer overflows.

Static Analysis

Use static analysis tools to detect issues like null pointer dereferences and use of uninitialized variables that could lead to faults.

Context Switching with Preemption

Preemptive context switching allows higher priority threads to interrupt lower priority ones. This prevents a single thread monopolizing the processor leading to better responsiveness.

The ARM Cortex-M processors have the SysTick timer that can trigger periodic preemptive context switches. This ensures every thread gets its fair share of execution time.

With preemption, critical threads can be assigned a higher priority so that they get scheduled more frequently. The operating system kernel also runs at the highest priority to remain responsive.

However preemption also means threads can get interrupted more often increasing the number of context switches. The processor spends more time saving and restoring contexts.

Leveraging Coprocessors to Reduce Overhead

Context switching usually involves saving the states of CPU registers. ARM processors also have additional coprocessors like the memory protection unit and floating point unit with their own set of registers.

By using the Lazy Context Switch option, we can avoid saving the coprocessor registers during a context switch. This saves time if the coprocessors are not actively used by a thread.

The linker file can enable lazy context switching which becomes active when triggering a service call exception. Only when the coprocessor is first accessed will its state be saved.

Task State Segment for Faster Switching

The task state segment or TSS allows saving the complete context of a thread to memory. This includes the general purpose registers, coprocessor registers, system registers and stack pointers.

When a context switch occurs, the hardware automatically swaps to the new thread’s TSS segment. This allows very fast switching between threads without needing driver software.

The TSS contains a busy bit that is set while the context is loaded in the processor. This prevents accidental task switches while already executing.

Segment violation exceptions are triggered if the TSS memory area overlaps or gets corrupted. This protects the integrity of the saved contexts.

Conclusion

Context switching enables efficient multi-tasking on ARM devices but can also lead to hard faults if not done properly. By understanding the causes of stack overflows, collisions, corruption and incomplete context saving, developers can find solutions to debug and fix these hard faults. Following best practices around stack allocation, limiting nested calls, assertions and static analysis can help prevent context switch related faults. Architectural features like preemptive scheduling, lazy context switching and TSS further reduce overheads and improve robustness for embedded and real-time systems.