Skip to content

k_sem_give faults (cache access error) from IRAM ISR: z_sem_pop_waiter is flash-resident #53

Description

@swoisz

Summary

k_sem_give() is marked K_ISR_SAFE (i.e. IRAM_ATTR) so it can be called
from ESP-IDF interrupts that may run while the flash cache is disabled. However,
on its hot path it calls the static helper z_sem_pop_waiter()
(components/zkernel/src/k_sem.c), which is not K_ISR_SAFE and therefore
lands in flash-mapped (cached) text. When k_sem_give() is invoked from an IRAM
ISR during a window where the cache is disabled (e.g. a concurrent SPI
flash/NVS write or erase), fetching z_sem_pop_waiter() faults with a cache
access error and the chip panics.

Why this matters

The inline comment on k_sem_give() documents the exact contract this breaks:
it is K_ISR_SAFE so that an interrupt allocated with ESP_INTR_FLAG_IRAM can
give a semaphore safely while the cache is off. A very common path that hits
this is an IRAM-registered GPIO interrupt whose handler calls k_work_submit()
-- work submission wakes the target work queue via k_sem_give(). The boreas
GPIO driver installs its ISR service with ESP_INTR_FLAG_IRAM
(components/zdevice/src/gpio_dt.c), so any such handler that submits work and
happens to fire during a flash operation will fault.

Observed

Panic Guru Meditation Error: ... Cache error / Cache access error. Symbolized
backtrace (innermost first):

z_sem_pop_waiter            k_sem.c            <-- faulted here (cache error)
k_sem_give                  k_sem.c
k_work_submit_internal      k_work.c
k_work_submit_to_queue      k_work.c
k_work_submit               k_work.c
<IRAM GPIO ISR handler> -> k_work_submit
gpio_esp32_isr              gpio_dt.c
<esp_driver_gpio ISR dispatch>

MEPC resolves into the flash-mapped region (0x4200_0000+) at
z_sem_pop_waiter; every other frame on the stack is in IRAM (0x4080_0000+).
The fault is intermittent by nature -- it requires the interrupt to land inside
a cache-disabled flash window.

Root cause

z_sem_pop_waiter() (the wake-target popper called on k_sem_give()'s hot
path, and also by k_sem_reset()) lacks K_ISR_SAFE, so it is flash-resident
while its ISR-safe caller is in IRAM. An IRAM function must only call
IRAM-resident code when the cache may be disabled; this one link in the
k_sem_give call graph violates that.

Proposed fix

Mark z_sem_pop_waiter() K_ISR_SAFE. The function is pure list-walking over
the caller-owned waiter list with no FreeRTOS calls, so it is safe to place in
IRAM. Its other caller, k_sem_reset(), is flash-resident, but flash code
calling an IRAM function is fine.

static K_ISR_SAFE struct z_sem_waiter *z_sem_pop_waiter(struct k_sem *sem)

Verified on an esp32c5 target: the symbol relocates from the flash-mapped
region into IRAM, the cache-error panic no longer reproduces, the host test
suite is unaffected, and clang-format stays clean. IRAM cost is a single
small list-walk function (negligible).

Suggested follow-up

Audit the rest of the ISR-reachable call graph from K_ISR_SAFE entry points
(k_sem_give, k_sem_take with K_NO_WAIT, the k_work_submit chain) for any
other static helpers that are flash-resident. The same class of bug -- an
IRAM_ATTR function calling a non-IRAM_ATTR helper -- would be latent
anywhere a helper was factored out without carrying the attribute.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions