Fix OpenBLAS atfork SIGSEGV during Kit startup#5693
Conversation
NumPy 2.x ships a bundled OpenBLAS (libscipy_openblas64_) that spawns worker threads at import time and registers blas_thread_shutdown_ as a pthread_atfork child handler. When test files (or standalone scripts) import torch at the top level before AppLauncher is instantiated, the thread pool is already live when Kit's libomni.platforminfo.plugin calls fork() during startup. In the child process the atfork handler tries to pthread_join the (now non-existent) worker threads, causing SIGSEGV. Set OPENBLAS_NUM_THREADS=1 via os.environ.setdefault in two places: - tools/conftest.py: injected into every test-subprocess environment so CI tests are protected regardless of their top-level imports. - app_launcher.py: set at module scope (before isaacsim is imported) so standalone scripts that import torch before AppLauncher are also safe. setdefault is used so that an explicit user or CI setting is never overridden. Verified on 8×L40 (torch 2.10, numpy 2.3.5, scipy 1.17.1): - import torch loads libscipy_openblas64_-fdde5778.so (exact crash lib) - Default: 64 OS threads spawned; OPENBLAS_NUM_THREADS=1: stays at 1 - test_ray_caster_patterns.py: 90/90 passed with the fix applied
Greptile SummaryThis PR fixes an intermittent
Confidence Score: 4/5The change is a targeted, one-line env-var guard in two files with no logic branching; the crash being fixed is well understood and the mechanism is correct. Both changes are minimal and mechanically sound. The
Important Files Changed
Sequence DiagramsequenceDiagram
participant Script as Standalone Script / Test
participant ALpy as app_launcher.py (module scope)
participant OB as OpenBLAS (via NumPy/torch)
participant Kit as Kit / libomni.platforminfo
participant Child as Forked Child Process
Note over Script,Child: WITHOUT fix (crash path)
Script->>OB: import torch → OpenBLAS spawns 64 worker threads
Script->>ALpy: from isaaclab.app import AppLauncher
ALpy->>Kit: SimulationApp.__init__
Kit->>Child: fork()
Child->>OB: blas_thread_shutdown_ (atfork handler)
OB-->>Child: pthread_join(missing threads) → SIGSEGV 💥
Note over Script,Child: WITH fix (safe path)
ALpy->>ALpy: os.environ.setdefault("OPENBLAS_NUM_THREADS","1")
Script->>OB: import torch → OpenBLAS starts with 0 worker threads
ALpy->>Kit: SimulationApp.__init__
Kit->>Child: fork()
Child->>OB: blas_thread_shutdown_ (atfork handler)
OB-->>Child: no threads to join → safe no-op ✅
Reviews (1): Last reviewed commit: "fix: prevent OpenBLAS atfork SIGSEGV dur..." | Re-trigger Greptile |
| # thread count to 1 *before* the library is loaded avoids the crash because | ||
| # no worker threads are created and the atfork handler becomes a no-op. | ||
| # Uses setdefault so that an explicit user/CI setting is respected. | ||
| os.environ.setdefault("OPENBLAS_NUM_THREADS", "1") |
There was a problem hiding this comment.
Single-threaded BLAS applied globally to all users
OPENBLAS_NUM_THREADS=1 is set unconditionally at module scope, so every process that imports app_launcher — including users running batch physics computations, inverse-kinematics solves, or any heavy NumPy/SciPy workload — silently loses multi-threaded BLAS performance. The fork-safety hazard only materialises during startup when Kit calls fork(), so on hardware where the issue does not reproduce, users pay the single-thread tax with no benefit. A narrower alternative would be to reset the env var (or the pool) only when a fork is about to occur, e.g. via os.register_at_fork, but if the performance cost is accepted for Isaac Lab's GPU-first workloads this is fine as-is.
| # pthread_join threads that no longer exist → SIGSEGV. Limiting | ||
| # OpenBLAS to a single thread before the subprocess starts avoids the | ||
| # crash because no worker threads are created and the handler is a no-op. | ||
| env.setdefault("OPENBLAS_NUM_THREADS", "1") |
There was a problem hiding this comment.
Coverage gap for directly-invoked test runs
The guard is injected only when tests are dispatched through run_individual_tests, which spawns each test file as a child process. Developers who run pytest tests/test_foo.py directly on their workstation (a common local workflow) do not get this protection — the parent process has no guarantee that OPENBLAS_NUM_THREADS is set before import torch at the top of a test module fires. The crash is an intermittent race, so this may go unnoticed for long stretches and then surface unexpectedly. Placing the same os.environ.setdefault("OPENBLAS_NUM_THREADS", "1") at the top of this conftest.py module (so pytest applies it before collecting test files) would close the gap for all invocation paths.
Description
Fixes an intermittent
SIGSEGVcrash during Kit startup caused by NumPy's bundled OpenBLASpthread_atforkhandler.Root cause
NumPy 2.x ships a bundled OpenBLAS (
libscipy_openblas64_) that spawns worker threads at import time and registersblas_thread_shutdown_as a child-sidepthread_atforkhandler. When test files (or standalone scripts)import torchat the top level beforeAppLauncheris instantiated, the OpenBLAS thread pool is already live by the time Kit'slibomni.platforminfo.plugincallsfork()during startup. In the child process, the atfork handler tries topthread_joinworker threads that were not carried across the fork → SIGSEGV.This is a different variant from the crash fixed by deferring
import torchinside_resolve_device_settings()(which protects theapp_launcher.py-internal import path). The new crash occurs when test files or standalone scripts import torch at module scope beforeAppLauncher.__init__runs.Fix
Set
OPENBLAS_NUM_THREADS=1viaos.environ.setdefaultin two places:tools/conftest.py: injected into every test-subprocess environment, so CI tests are protected regardless of their top-level imports.app_launcher.py: set at module scope (beforeisaacsimis imported), so standalone scripts that import torch beforeAppLauncherare also safe.setdefaultis used so that an explicit user or CI override is never clobbered.With
OPENBLAS_NUM_THREADS=1, OpenBLAS starts with zero worker threads. Theblas_thread_shutdown_atfork handler is still registered but becomes a safe no-op — there are no threads to join.Verification
Tested on 8×L40 (torch 2.10.0, numpy 2.3.5, scipy 1.17.1, Isaac Sim 6.0.0-rc.40):
import torchloadslibscipy_openblas64_-fdde5778.soimport torchOPENBLAS_NUM_THREADS=1test_ray_caster_patterns.pywith fixtest_ray_caster_patterns.pywithout fixType of change
Checklist
pre-commitchecks with./isaaclab.sh --format