Debugging the Disappearing Service Processor: A Tale of Mismatched Memory Attributes

embedded systems

Explore Oxide's challenging journey debugging an intermittent Service Processor network disconnection. This deep dive uncovers a subtle hardware-software interaction involving mismatched memory attributes on a Cortex-M7 STM32H7, highlighting the complexities of modern CPU debugging.

Designing the Oxide rack involves careful consideration of component accessibility. Intended for data centers with network-exclusive access, physical intervention is primarily for replacing failing hardware like disks. Crucially, our Service Processor (SP) is designed for network-based management.

During initial deployments of our next-generation Cosmo sled into an Oxide rack, the Service Processor intermittently disconnected from the network. This presented a significant debugging challenge, as loss of network access severely limited our ability to diagnose the SP's state. Initial debugging efforts focused on other system indicators:

The AMD host CPU remained powered, indicating overall system integrity.
The SP ceased broadcasting its presence over the management network.
No network data counter increases originated from the SP.
Fans ran at a consistently elevated speed, suggesting the SP's fan controller had reverted to an emergency full-power mode.
Crucially, this issue was not reproducible on sleds outside the rack environment.

The SP operates on our custom OS, Hubris, which structures system functions (networking, thermal control, updates) as distinct tasks. While not a true RTOS, Hubris incorporates task priorities. An early hypothesis suggested a software bug causing task starvation, where a hung or crash-looping task consumed all CPU cycles, preventing the networking task from executing. To investigate, we increased task restart delays and modified the chassis LED from 'always on' to blinking, providing a visual indicator of SP activity even without network access.

These debug modifications allowed us to reproduce the issue, though with perplexing results: the LED would sometimes be stuck on, other times stuck off. Given the LED blinking task's high priority, this narrowed the potential sources of a stuck task. While Rust significantly reduces bug classes like buffer overflows in Hubris, stack overflows remain a challenge due to manual stack sizing requirements. Although task-level stack overflows result in safe restarts, a kernel-level stack overflow could mimic the observed stalled behavior. However, the kernel's substantial stack margins (512 bytes) made this less probable.

To gain deeper insights, we resorted to using SWD debug headers, typically reserved for manufacturing and not intended for production systems, especially within a live rack. Attaching these required creative cable routing with team assistance.

Fortunately, this yielded a breakthrough: we reproduced the issue with the debug probe connected. However, the probe was unable to halt the CPU (a Cortex-M7 STM32H7), severely restricting diagnostic data extraction and indicating a critical system state.

Our investigation shifted towards system components that could induce such a state. A key difference from our previous Gimlet system was the integration of an FPGA to manage elements like host flash. This FPGA interfaces with the STM32H7 via a legacy parallel bus, resembling a RAM interface, and is managed by the Flexible Memory Controller (FMC). According to the manual (Section 22.1 RM0433), the FMC's primary roles are:

Translating AXI transactions into the appropriate external device protocol.
Meeting the access time requirements of external memory devices.

A CPU can stall if it fails to receive a bus acknowledgment from an external device. For instance, an FPGA timing error could cause the CPU to indefinitely hang when attempting a register read. To validate this, we developed an FPGA test image featuring a register designed to intentionally hang the FMC bus upon read. This experiment replicated the observed behavior closely, strongly implicating the FMC bus as the source of the problem.

While full system dumps are standard for Hubris debugging, they require CPU halting, which was impossible. We then utilized ARM CPU 'vector catch' functionality, configuring the CPU to halt immediately after a reset, before executing the first instruction. This successfully unstuck the CPU, preserving most Hubris RAM state, though the program counter and running register state were lost. Analysis revealed no active Hubris task accessing the FMC.

Hardware engineers reviewed FPGA timings and identified potential violations of memory interface constraints. A fix was implemented. Initial vector catch dumps after this fix appeared inconsistent, likely due to cache effects. Disabling the cache yielded consistent dumps, but the core issue remained unreproduced.

Hubris development proceeded. A key change involved our measured boot work, where the Root of Trust (RoT) hashes the SP flash at bootup for higher-level software. To ensure required security, the SP performs multiple self-resets during initial boot. Testing this change unexpectedly resurfaced the original symptoms: the Cosmo SP vanishing from the network. This development was a breakthrough, dramatically reducing the issue's reproduction rate from over 24 hours to 10-20 minutes. Despite the increased frequency, initial dumps still offered no clear culprit, though suspicion remained high on the FMC bus due to the limited scenarios that could cause such behavior.

The accelerated reproduction allowed for numerous experiments, none of which proved successful, including:

Adjusting reset rates and sequences before normal boot.
Performing additional FPGA bit stream clearing.
Restricting task access to the FMC bus.
Removing seemingly unrelated tasks.

A deep dive into the STM32H7 manual finally provided crucial insight: the processor itself might be performing unexpected accesses on the FMC bus. Modern CPUs manage significant internal state not directly visible to programmers, making it difficult to predict cache operations. A CPU writing cached data to memory constitutes a memory access, potentially to addresses unrelated to the current program counter.

Hubris employs a Memory Protection Unit (MPU) for task isolation and privilege enforcement. While unprivileged tasks use the MPU, the privileged kernel relies on the default memory map. We had configured the FMC for tasks as Uncached Device Memory. However, the STM32H7 manual revealed that our chosen FMC base address had a default memory type of Normal Cached. This discrepancy meant the FMC had different memory attributes depending on whether it was accessed by a task or the kernel.

Section A3.5.7 of the ARMv7-M reference manual details issues arising from mismatched memory attributes. Hardware engineers pinpointed 'Preservation of the size of accesses' as a critical concern, as our FPGA interface expected 32-bit accesses, making smaller 16-bit or 8-bit accesses problematic.

Crucially, the kernel was never intentionally accessing the FMC via the Normal Cached mapping. The most probable sequence of events was:

An unprivileged task accessing the FMC issues a store, which enters the processor's store buffer.
An interrupt occurs, transitioning the system to privileged mode, which uses the default memory map.
The store hits the cache because the default memory map designated that address as cached.
The cache then attempts to write to memory in a manner inconsistent with the expected Device Memory attributes.

ARMv7-M Section A3.5.7 explicitly recommends against using mismatched attributes for aliases of the same location. The default ARM memory map, utilized by the kernel, includes a section specifically configured for uncached device memory – precisely what was needed. The STM32H7 FMC supports relocating its base address to this section, likely to mitigate this exact type of problem.

The ultimate solution involved changing the FMC base address to this section with matching attributes. Since implementing this fix, the issue has not recurred.

Transparency remains a core value at Oxide. Debugging modern CPUs often delves into opaque areas, making questions like 'Under what circumstances will memory bus access fail?' particularly challenging. In this instance, comprehensive documentation from ARM and STMicroelectronics was instrumental in resolving the issue. Given the complexity of this debug process, we believe emphasizing such potential problems in vendor documentation would greatly benefit all customers. Oxide encourages all hardware vendors to maintain thorough documentation for the advantage of their user base.