Skip to content

Fix OpenBLAS atfork SIGSEGV during Kit startup#5693

Open
fatimaanes wants to merge 1 commit into
isaac-sim:developfrom
fatimaanes:fix/openblas-fork-crash
Open

Fix OpenBLAS atfork SIGSEGV during Kit startup#5693
fatimaanes wants to merge 1 commit into
isaac-sim:developfrom
fatimaanes:fix/openblas-fork-crash

Conversation

@fatimaanes
Copy link
Copy Markdown
Collaborator

Description

Fixes an intermittent SIGSEGV crash during Kit startup caused by NumPy's bundled OpenBLAS pthread_atfork handler.

Root cause

NumPy 2.x ships a bundled OpenBLAS (libscipy_openblas64_) that spawns worker threads at import time and registers blas_thread_shutdown_ as a child-side pthread_atfork handler. When test files (or standalone scripts) import torch at the top level before AppLauncher is instantiated, the OpenBLAS thread pool is already live by the time Kit's libomni.platforminfo.plugin calls fork() during startup. In the child process, the atfork handler tries to pthread_join worker threads that were not carried across the fork → SIGSEGV.

This is a different variant from the crash fixed by deferring import torch inside _resolve_device_settings() (which protects the app_launcher.py-internal import path). The new crash occurs when test files or standalone scripts import torch at module scope before AppLauncher.__init__ runs.

Fix

Set OPENBLAS_NUM_THREADS=1 via os.environ.setdefault in two places:

  • tools/conftest.py: injected into every test-subprocess environment, so CI tests are protected regardless of their top-level imports.
  • app_launcher.py: set at module scope (before isaacsim is imported), so standalone scripts that import torch before AppLauncher are also safe.

setdefault is used so that an explicit user or CI override is never clobbered.

With OPENBLAS_NUM_THREADS=1, OpenBLAS starts with zero worker threads. The blas_thread_shutdown_ atfork handler is still registered but becomes a safe no-op — there are no threads to join.

Verification

Tested on 8×L40 (torch 2.10.0, numpy 2.3.5, scipy 1.17.1, Isaac Sim 6.0.0-rc.40):

Check Result
import torch loads libscipy_openblas64_-fdde5778.so ✅ Confirmed (exact library from crash stack)
Default thread count after import torch 64 OS threads
Thread count with OPENBLAS_NUM_THREADS=1 1 OS thread (no workers)
test_ray_caster_patterns.py with fix ✅ 90/90 passed
test_ray_caster_patterns.py without fix ✅ 90/90 passed (crash is an intermittent race)
Pre-commit checks ✅ All passed

Type of change

  • Bug fix (non-breaking change that fixes an issue)

Checklist

  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective
  • I have updated the changelog and added my name to the list of contributors

NumPy 2.x ships a bundled OpenBLAS (libscipy_openblas64_) that spawns
worker threads at import time and registers blas_thread_shutdown_ as a
pthread_atfork child handler.  When test files (or standalone scripts)
import torch at the top level before AppLauncher is instantiated, the
thread pool is already live when Kit's libomni.platforminfo.plugin calls
fork() during startup.  In the child process the atfork handler tries to
pthread_join the (now non-existent) worker threads, causing SIGSEGV.

Set OPENBLAS_NUM_THREADS=1 via os.environ.setdefault in two places:

- tools/conftest.py: injected into every test-subprocess environment so
  CI tests are protected regardless of their top-level imports.
- app_launcher.py: set at module scope (before isaacsim is imported) so
  standalone scripts that import torch before AppLauncher are also safe.

setdefault is used so that an explicit user or CI setting is never
overridden.

Verified on 8×L40 (torch 2.10, numpy 2.3.5, scipy 1.17.1):
- import torch loads libscipy_openblas64_-fdde5778.so (exact crash lib)
- Default: 64 OS threads spawned; OPENBLAS_NUM_THREADS=1: stays at 1
- test_ray_caster_patterns.py: 90/90 passed with the fix applied
@github-actions github-actions Bot added bug Something isn't working isaac-lab Related to Isaac Lab team infrastructure labels May 19, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR fixes an intermittent SIGSEGV crash during Kit startup by setting OPENBLAS_NUM_THREADS=1 (via setdefault) in two places: at module scope in app_launcher.py and in the subprocess environment dict inside conftest.py's run_individual_tests. The root cause is NumPy's bundled OpenBLAS spawning a worker-thread pool at import time and registering a pthread_atfork child handler that attempts to join those threads after Kit's fork() call — a safe no-op only when no workers were created.

  • app_launcher.py: env var is injected before isaacsim is imported, protecting standalone scripts where AppLauncher is the first import; setdefault preserves explicit user overrides.
  • tools/conftest.py: env var is added to the subprocess env dict, ensuring CI test processes inherit the constraint from process start, regardless of their top-level imports.
  • changelog.d/fix-openblas-fork-crash.rst: new changelog entry clearly documents the root cause, mechanism, and fix.

Confidence Score: 4/5

The change is a targeted, one-line env-var guard in two files with no logic branching; the crash being fixed is well understood and the mechanism is correct.

Both changes are minimal and mechanically sound. The setdefault idiom correctly preserves explicit overrides. The open question is whether the blanket single-thread BLAS limit is an acceptable trade-off for all workloads that touch app_launcher, and whether the coverage gap for direct pytest invocations in developer workflows matters enough to close before merging.

tools/conftest.py — the guard only fires when tests are launched through run_individual_tests; direct pytest invocations from a developer's shell are not covered. app_launcher.py — the module-scope OPENBLAS_NUM_THREADS=1 affects every user, not just those who trigger the fork-safety path.

Important Files Changed

Filename Overview
source/isaaclab/isaaclab/app/app_launcher.py Adds os.environ.setdefault("OPENBLAS_NUM_THREADS", "1") at module scope before the isaacsim import; fix is correctly ordered for the typical AppLauncher-first import pattern but caps BLAS to one thread globally for all users
tools/conftest.py Adds env.setdefault("OPENBLAS_NUM_THREADS", "1") to the subprocess environment dict inside run_individual_tests; protects CI test subprocesses but does not cover direct pytest invocations from the command line
source/isaaclab/changelog.d/fix-openblas-fork-crash.rst New changelog entry accurately documents the root cause and fix; no issues

Sequence Diagram

sequenceDiagram
    participant Script as Standalone Script / Test
    participant ALpy as app_launcher.py (module scope)
    participant OB as OpenBLAS (via NumPy/torch)
    participant Kit as Kit / libomni.platforminfo
    participant Child as Forked Child Process

    Note over Script,Child: WITHOUT fix (crash path)
    Script->>OB: import torch → OpenBLAS spawns 64 worker threads
    Script->>ALpy: from isaaclab.app import AppLauncher
    ALpy->>Kit: SimulationApp.__init__
    Kit->>Child: fork()
    Child->>OB: blas_thread_shutdown_ (atfork handler)
    OB-->>Child: pthread_join(missing threads) → SIGSEGV 💥

    Note over Script,Child: WITH fix (safe path)
    ALpy->>ALpy: os.environ.setdefault("OPENBLAS_NUM_THREADS","1")
    Script->>OB: import torch → OpenBLAS starts with 0 worker threads
    ALpy->>Kit: SimulationApp.__init__
    Kit->>Child: fork()
    Child->>OB: blas_thread_shutdown_ (atfork handler)
    OB-->>Child: no threads to join → safe no-op ✅
Loading

Reviews (1): Last reviewed commit: "fix: prevent OpenBLAS atfork SIGSEGV dur..." | Re-trigger Greptile

# thread count to 1 *before* the library is loaded avoids the crash because
# no worker threads are created and the atfork handler becomes a no-op.
# Uses setdefault so that an explicit user/CI setting is respected.
os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Single-threaded BLAS applied globally to all users

OPENBLAS_NUM_THREADS=1 is set unconditionally at module scope, so every process that imports app_launcher — including users running batch physics computations, inverse-kinematics solves, or any heavy NumPy/SciPy workload — silently loses multi-threaded BLAS performance. The fork-safety hazard only materialises during startup when Kit calls fork(), so on hardware where the issue does not reproduce, users pay the single-thread tax with no benefit. A narrower alternative would be to reset the env var (or the pool) only when a fork is about to occur, e.g. via os.register_at_fork, but if the performance cost is accepted for Isaac Lab's GPU-first workloads this is fine as-is.

Comment thread tools/conftest.py
# pthread_join threads that no longer exist → SIGSEGV. Limiting
# OpenBLAS to a single thread before the subprocess starts avoids the
# crash because no worker threads are created and the handler is a no-op.
env.setdefault("OPENBLAS_NUM_THREADS", "1")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Coverage gap for directly-invoked test runs

The guard is injected only when tests are dispatched through run_individual_tests, which spawns each test file as a child process. Developers who run pytest tests/test_foo.py directly on their workstation (a common local workflow) do not get this protection — the parent process has no guarantee that OPENBLAS_NUM_THREADS is set before import torch at the top of a test module fires. The crash is an intermittent race, so this may go unnoticed for long stretches and then surface unexpectedly. Placing the same os.environ.setdefault("OPENBLAS_NUM_THREADS", "1") at the top of this conftest.py module (so pytest applies it before collecting test files) would close the gap for all invocation paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant