Core lockup refers to a situation where a CPU core enters an unrecoverable stuck state and stops functioning correctly. This can happen due to bugs in the CPU design or firmware, unanticipated hardware conditions, or software errors. A core lockup causes the affected core to stop responding to interrupts and executing instructions, rendering it unusable until the system is rebooted.
Causes of Core Lockup
There are several potential causes for a CPU core lockup:
- Hardware design flaws – Bugs in the CPU core logic or supporting hardware like caches, bus interconnects etc. can cause a core to enter an invalid state.
- Firmware bugs – Bugs in the low-level firmware code that initializes and configures the CPU can also lead to lockups.
- Undefined instructions – Attempting to execute invalid or undefined opcode can lock up a core.
- Access violations – Illegal memory accesses, invalid cache operations etc. can crash the core.
- Resource deadlocks – Simultaneous requests for shared hardware resources like buses, registers etc. can cause deadlocks.
- Power issues – Voltage droops, noise or other power delivery flaws can destabilize CPU operation.
- Overheating – Thermal issues can trigger emergency shutdown of cores to prevent damage.
- Excessive interrupts – Frequent interrupts, above hardware limits can overwhelm and freeze a core.
- Software bugs – Programming errors like infinite loops, stack overflows, race conditions etc. in OS or application code.
Symptoms of Core Lockup
A locked up CPU core exhibits following observable symptoms:
- The affected core stops executing any instructions or handling interrupts.
- Kernel tasks and threads remain blocked on that core.
- Applications freeze if their processes were running on the locked core.
- The core stops appearing in OS status outputs like task manager.
- Performance counters for the core stop updating.
- Heartbeat or watchdog timeouts triggered for the unresponsive core.
- Error logs or panic messages indicating core failure.
From a user perspective, symptoms include system freezes, crashes, display freeze, input unresponsiveness in applications utilizing the locked core. Multi-core systems may continue functioning on unaffected cores unless the locked core held critical OS resources.
Debugging Core Lockups
Debugging CPU core lockups requires gaining visibility into the hardware state when the lockup occurred. Methods include:
- Hardware counters – Performance monitoring counters and event trackers can provide insight into code execution flow.
- Machine state – OS can capture architectural state of locked core like register contents, instruction pointers etc.
- Log buffers – On-chip log buffers track hardware events and exceptions leading up to lockup.
- Post-mortem dumping – Dumping core memory and state after a lock-up can support offline debugging.
- Watchdog timers – Configuring watchdog timeouts forces core reset upon lockup.
- Logic analyzers – External debuggers attached to processor buses can trace instruction flow in real-time.
Vendors also implement specialized hardware debug modes and interfaces to inspect internal CPU state for identifying root causes of core lockups.
Mitigating Core Lockups
Software and hardware techniques can help mitigate or limit the impact of CPU core lockups:
- Lockstep cores – Lockstepped redundant cores with lockstep monitors can auto-recover from transient lockups.
- Heartbeat monitoring – Heartbeat or watchdog timers per core to detect lockups and reset cores.
- Firmware resilience – Robust firmware initialization and configuration to avoid lockups.
- Undefined instruction handling – Trapping undefined opcode execution instead of locking up.
- Lockup-tolerant kernel – OS structures that prevent lockups from crashing entire system.
- Language mitigations – Programming language techniques like memory safety, bounds checking etc. to minimize bugs.
- Resource virtualization – Virtualization and arbitration of shared hardware resources.
- Overengineering margins – Guardbands and conservative design to tolerate corner cases.
However, fundamental hardware bugs may still require silicon fixes in newer revisions or microcode updates to workaround.
Recovering from Core Lockups
Depending on system capabilities, core lockup recovery may involve:
- Hardware reset – Forcing a hard reset of the locked core via watchdog timers or external resets.
- Core reboot – Firmware or OS capability to selectively reboot individual cores.
- Core disabling – Isolating locked core and continuing with reduced cores.
- Full system restart – Complete system reboot if selective core reset is not possible.
- Core swap – In multi-socket systems, migrating OS sessions to cores on alternate socket.
- Microcode update – Applying microcode patches to workaround hardware flaws.
Recovery re-initializes the CPU core and allows the OS to resume scheduling tasks or processes on it. But data or state associated with the locked core may be lost unless saved earlier.
Core Lockup Prevention
Some best practices to help avoid core lockups are:
- Thoroughly validating CPU design, firmware, OS, applications via pre-silicon verification, simulation, prototypes, static validation, stress testing etc.
- Enabling lockup detection and mitigation capabilities in hardware and software.
- Following secure coding guidelines and performing code reviews to minimize bugs.
- Using supported configurations and workloads within validated operating parameters.
- Applying necessary microcode, firmware, OS, driver, application updates and patches.
- Monitoring for CPU errata and promptly applying available microcode updates.
- Detecting and resolving performance and thermal hotspots.
- Exercising fault injection and corner case testing as part of validation.
However, real-world variability implies some residual lockup risk may persist which requires runtime handling.
Core lockups can significantly degrade system performance and responsiveness. A locked core reduces available CPU capacity, increasing utilization on remaining cores. Lockups in cores handling critical processes like OS kernel, IO buses, etc. can stall the entire system.
Frequent core lockups leading to resets and reboots also diminish performance until the underlying cause is fixed. Lockup overhead depends on reset latency, recovery time, and capability to isolate the failed core.
In mission-critical infrastructure, lockups can violate service level objectives. Performance impact can range from negligible in overprovisioned systems to total outage in single-threaded systems lacking redundancy.
Reliability and Availability Risks
Core lockups present significant reliability and availability risks for systems and services relying on affected processors:
- System crashes, hangs, freeze due to unexpected lockups can cause outages, impacting service uptime.
- Data loss or corruption if lockup occurs during reads/writes to disk or memory.
- Malfunctions, performance safety issues in real-time automotive, industrial, aerospace systems.
- Security vulnerabilities from denial of service, exposure of privileged information.
- Inconsistent computing results if core lockups are not deterministic.
Sensitivity to lockup risks depends on workload redundancy, availability of spares, fallback systems, and backup mechanisms. Safety-critical systems may require extensive lockup prevention via hardware redundancy and software validation.
Diagnosing Lockup Causes
A systematic methodology is required to diagnose root causes of core lockups:
- Reproduce the lockup reliably on an affected system.
- Collect all available pre-lockup and post-lockup information like logs, register dumps, hardware counters etc.
- Determine whether issue is hardware or software triggered by testing with and without various software components.
- Analyze collected data to narrow down to specific hardware blocks or software modules involved.
- Identify if any firmware/microcode, hardware errata, or software bugs relate to the observed failure mode.
- Attempt lockup mitigation fixes like microcode updates to confirm diagnosis.
- If hardware bug, identify design flaw based on internal design visibility and pinpoint correction needed.
- For software issues, debug and fix offending application or OS code.
Vendor design teams follow a similar process to diagnose and correct any core lockup issues escaping pre-silicon validation.
Core Lockups in ARM-based Systems
ARM processor cores implement multiple strategies to detect and recover from core lockups:
- Watchdog timers – Each core has local watchdog timer to detect lockups.
- Heartbeat signaling – Cores signal heartbeats for aliveness monitoring.
- Lockstep cores – Lockstepped cores allow faulty core reset without system impact.
- Parity/ECC – Memory error detection allows identifying faulty memory access locks.
- Bus error handling – Bus fabric responds to illegal bus access patterns.
ARM cores also include extensive debug capabilities like embedded trace macrocells, microcode patches, which help in analyzing potential lockup sources. architectural and microarchitectural mitigations further limit lockup risks.
Nevertheless, ARM-based systems remain susceptible to lockups from firmware, power management, interconnect, memory hierarchy, IO peripheral, and software issues. Proper validation, stress testing, failure analysis, and firmware-hardware co-debug help minimize such risks during product development.