Skip to content

feat: Full Fabric acceleration stack — local poses, stage cache, fused compose#5728

Draft
pv-nvidia wants to merge 6 commits into
isaac-sim:developfrom
pv-nvidia:pv/fabric-full-stack
Draft

feat: Full Fabric acceleration stack — local poses, stage cache, fused compose#5728
pv-nvidia wants to merge 6 commits into
isaac-sim:developfrom
pv-nvidia:pv/fabric-full-stack

Conversation

@pv-nvidia
Copy link
Copy Markdown
Contributor

Summary

Consolidated PR bringing the full Fabric acceleration stack to FabricFrameView. This builds on the merged mGPU support (#5514) and adds:

  1. Indexed Fabric transform kernels — Warp kernels for local↔world matrix propagation using IndexedFabricArray
  2. Factory-based dispatch — Move Fabric/USD selection from internal FabricFrameView fallback to the FrameView factory (eliminates misleading abstraction)
  3. Fused compose kernel — Single kernel launch for set_world_poses + set_scales instead of separate calls
  4. FabricStageCache — Shared hierarchy handles across views for faster scene creation (~8× improvement)
  5. Fabric-accelerated local posesget_local_poses() / set_local_poses() operate directly on Fabric localMatrix, with automatic world→local recomposition
  6. Device gate fix — Allow FabricFrameView on any cuda:<N> device (was incorrectly restricted to cuda:0 only)

Motivation

Piotr's wrist camera shimmering on multi-GPU (#5717 context): parent prim local transforms change on cuda:1 but camera world pose doesn't get recomposed → stale poses. The local→world composition pipeline in this PR fixes camera pose propagation on non-primary GPUs.

Performance

On 2x L40 (4 envs, wrist camera benchmark):

  • Scene creation: 12.2s → 1.5s (8× faster from FabricStageCache)
  • FPS parity at low env counts (camera-render-dominated); gains at 64+ envs
  • cuda:1 now gets full Fabric acceleration (was silently falling back to USD)

Testing

  • Existing test_views_xform_prim_fabric.py extended with local pose tests, topology rebuild tests, and multi-GPU roundtrip tests
  • New test_fabric_stage_cache.py for the stage cache service
  • New test_fabric_kernels.py for indexed transform kernels

Individual PRs (now superseded by this consolidated PR)

pv-nvidia added 6 commits May 21, 2026 14:07
…pagation

Add Warp kernels that operate on wp.indexedfabricarray for direct
local↔world matrix propagation without round-tripping through USD:

- decompose_indexed_fabric_transforms: extract pos/quat/scale from ifa
- compose_indexed_fabric_transforms: write pos/quat/scale into ifa
- update_indexed_local_matrix_from_world: local = inv(parent) * world
- update_indexed_world_matrix_from_local: world = parent * local

Also refactor existing kernels to use wp.where (branchless) instead of
if/else for broadcast index selection.

These kernels are the foundation for Fabric-accelerated get/set_local_poses
in FabricFrameView.
…factory

FabricFrameView had an internal _use_fabric flag that fell back to
UsdFrameView when Fabric was disabled or the device was unsupported.
This violated single-responsibility: FabricFrameView pretended to be
one class but sometimes behaved as another.

Now the FrameView factory handles all dispatch:
- PhysX + Fabric enabled + supported device → FabricFrameView
- PhysX without Fabric (or unsupported device) → UsdFrameView
- Newton → NewtonSiteFrameView

FabricFrameView no longer checks _use_fabric or _fabric_supported_devices.
It assumes Fabric is available (the factory guarantees this).

UsdFrameView is eagerly registered on the factory since it lives in
isaaclab (not a backend package), so FactoryBase's dynamic import
(isaaclab_{backend}.sim.views) can't discover it.
Extract _compose_fabric_transform() to deduplicate the kernel-launch
logic shared by set_world_poses and set_scales. The initial USD->Fabric
sync now composes position, orientation, and scale in one call, so
PrepareForReuse is invoked exactly once per logical update.

Also replace assert with RuntimeError in _rebuild_fabric_arrays so the
topology-change guard survives python -O.
Introduces FabricStageCache — a lightweight cache for the usdrt stage
attachment and IFabricHierarchy handles, registered as a service on
SimulationContext via the service locator (set_service/get_service).

Multiple FabricFrameView instances now share a single hierarchy handle
instead of each creating its own. The cache is automatically closed on
SimulationContext.clear_instance().

Also replaces the assert device check with a proper RuntimeError and
removes the now-unused isaaclab.sim import from fabric_frame_view.py.
- get_local_poses / set_local_poses via indexed fabric kernels
- get_scales / set_scales with full Fabric acceleration
- Per-view dirty tracking with automatic re-propagation
- Topology-change recovery via _rebuild_fabric_arrays
- 13 new integration tests (49 total pass)
The factory gate hardcoded cuda:0 as the only supported CUDA device.
Multi-GPU setups need cuda:1, cuda:2, etc. Replace the static tuple
with a function that accepts any valid cuda:<N> index.
@github-actions github-actions Bot added documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team labels May 21, 2026
@pv-nvidia pv-nvidia marked this pull request as draft May 21, 2026 14:19
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This is an impressive, well-architected PR that delivers significant performance improvements through Fabric GPU acceleration. The ~8× scene creation speedup and 150-260× faster pose operations (at 1024 prims) are substantial gains for large-scale simulations.

✅ Strengths

  1. Clean architecture: The FabricStageCache service pattern elegantly shares hierarchy handles across views without coupling.

  2. Correctness under non-trivial transforms: The transpose-convention handling in update_indexed_local_matrix_from_world and update_indexed_world_matrix_from_local correctly leverages (A·B)ᵀ = Bᵀ·Aᵀ to avoid explicit transposes on Fabric's column-transposed storage.

  3. Robust dirty-flag isolation: Per-view _world_dirty tracking prevents multi-view race conditions that would occur with a per-stage flag.

  4. Comprehensive test coverage: 13+ new integration tests cover rotated parents, scaled parents, multi-view isolation, and topology recovery scenarios.

  5. Factory dispatch refinement: Moving Fabric/USD dispatch from FabricFrameView internals to FrameView._get_backend() eliminates the misleading abstraction.

🔍 Findings & Suggestions


1. Potential race condition in concurrent PrepareForReuse calls
📍 fabric_frame_view.py_get_world_rw_array() / _get_local_rw_array() / _get_world_ro_array()

Each accessor independently calls PrepareForReuse() on its selection. In a scenario where set_world_poses() and set_local_poses() are called in rapid succession (e.g., from different threads or coroutines), there's a potential for:

  • _world_sel_rw.PrepareForReuse() returns True (topology change)
  • Rebuild starts but before completing, _local_sel_rw.PrepareForReuse() is called
  • Both rebuilds race on shared state

Suggestion: Consider adding a simple lock or consolidating rebuild logic to ensure atomicity during topology changes, especially if this class may be used concurrently.


2. Missing validation for _fabric_empty_2d_array_sentinel shape
📍 fabric_frame_view.py:126 & fabric.py kernels

The sentinel is wp.zeros((0, 0), ...) with shape (0, 0). The kernels check shape[0] > 0 which works, but the docstring states the inner dim "is never indexed" while the sentinel has inner dim 0. This works but is semantically confusing.

Suggestion: Consider wp.zeros((0, 3), ...) or wp.zeros((0, 4), ...) to match the expected shapes for positions/quaternions respectively, making intent clearer.


3. Device validation could be more robust
📍 frame_view.py:25-36_is_fabric_supported_device()

if device.startswith("cuda:"):
    try:
        int(device.split(":", 1)[1])
        return True
    except (ValueError, IndexError):
        pass

This accepts "cuda:999" even if only 2 GPUs exist. Consider validating against torch.cuda.device_count() or wp.cuda_device_count().

Severity: Low — USDRT/Warp will fail downstream with a clearer error, but early validation would improve UX.


4. Silent skip for invalid parent prims in _initialize_fabric()
📍 fabric_frame_view.py:521-528

rt_prim = self._stage.GetPrimAtPath(path)
if not rt_prim.IsValid():
    continue  # Silent skip

If a parent prim is invalid, initialization silently skips it, but later _compute_parent_fabric_indices() will raise a RuntimeError. The skip-then-fail pattern could be confusing during debugging.

Suggestion: Either raise early with context, or log a warning when skipping invalid prims.


5. Unused _FABRIC_SUPPORTED_DEVICES module variable
📍 frame_view.py:19

_FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None  # None = compute dynamically

This variable is declared but never used — _is_fabric_supported_device() hardcodes the logic instead.

Suggestion: Remove the unused variable or implement the promised "compute dynamically" behavior.


6. Test test_clear_instance_closes_service may have timing assumptions
📍 test_fabric_stage_cache.py:77-83

The test assumes SimulationContext.clear_instance() synchronously calls close() on services. If this happens asynchronously or is order-dependent, the assertion could flake.

Severity: Low — likely fine given IsaacLab's synchronous teardown pattern.


📊 Summary

Category Assessment
Correctness ✅ Excellent — transform math is correct, edge cases well-handled
Performance ✅ Excellent — 8× scene creation, 150-260× pose ops
Test Coverage ✅ Comprehensive — 13+ new tests including edge cases
Architecture ✅ Clean — service pattern, factory dispatch, per-view state
Documentation ✅ Good — changelog entries, docstrings, inline comments
Maintainability ⚠️ Good — some complexity in multi-selection management

Verdict

Approve — This is high-quality work that delivers substantial performance improvements with careful attention to correctness. The findings above are minor suggestions; none are blocking.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 21, 2026

Greptile Summary

This PR consolidates five individual PRs into a full Fabric acceleration stack for FabricFrameView, adding indexed transform kernels, a shared FabricStageCache, and Fabric-native get_local_poses/set_local_poses. The factory-level dispatch in FrameView is refactored so that Fabric vs. USD selection happens at creation time instead of inside the view.

  • Fabric-native local poses are implemented by writing/reading omni:fabric:localMatrix directly via IndexedFabricArray, with a per-view _world_dirty flag to trigger lazy world-from-local recomposition on the next world read.
  • FabricStageCache is added as a ServiceLocator-managed singleton that shares one IFabricHierarchy handle across all views on the same stage, reducing scene-creation time ~8×.
  • Fused compose kernel (compose_indexed_fabric_transforms) replaces the separate set_world_poses + set_scales kernel launches, and the assert prim-count guard is promoted to RuntimeError so it fires at every optimization level.

Confidence Score: 4/5

The core Fabric kernel math and the world ↔ local consistency protocol are both correct. The main risks are cosmetic: a dead import, a dead helper method, and an API inconsistency in get_scales. None of these affect correctness at runtime.

The transpose-convention proof in the new kernels is correct, the per-view dirty-flag isolation is properly tested with a multi-view scenario, and the FabricStageCache lifecycle is wired correctly through ServiceLocator. The three style-level findings (unused import, dead method, get_scales return type) are real cleanup work but do not change observable behavior. The unused _FABRIC_SUPPORTED_DEVICES variable in frame_view.py is misleading but inert.

fabric_frame_view.py warrants a second look to remove the unused SettingsManager import and the dead _get_parent_world_ro_array method before merging.

Important Files Changed

Filename Overview
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py Major refactor: removes internal USD fallback, adds Fabric-native local poses, three persistent selections, indexed fabric arrays, and per-view dirty tracking. Contains an unused import (SettingsManager) and a dead helper method (_get_parent_world_ro_array) that are leftovers from the refactor.
source/isaaclab/isaaclab/utils/warp/fabric.py Adds four new Warp kernels (decompose/compose indexed, update_indexed_local/world_from_world/local) and replaces if/else branching with wp.where. Transpose convention math is correct; kernel signatures are consistent.
source/isaaclab/isaaclab/sim/views/frame_view.py Factory dispatch moved from FabricFrameView internal fallback to this factory. Adds _is_fabric_supported_device, but declares _FABRIC_SUPPORTED_DEVICES module-level variable that is never read. No functional bugs found.
source/isaaclab_physx/isaaclab_physx/sim/fabric_stage_cache.py New lightweight singleton cache for usdrt Stage attachment and IFabricHierarchy handle. Lifecycle is correctly managed by ServiceLocator (close() called on SimulationContext teardown). No issues found.
source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py Removes xfail override for set_world_updates_local (now fixed), adds tests for Fabric local poses, rotated-parent transpose convention, scaled-parent seed, and per-view dirty-flag isolation. Topology-change test no longer exercises real PrepareForReuse=True path.
source/isaaclab_physx/test/sim/test_fabric_stage_cache.py New test suite for FabricStageCache; covers register/retrieve, hierarchy caching, close, and clear_instance lifecycle. Complete and correct.
source/isaaclab/test/utils/warp/test_fabric_kernels.py New unit tests for decompose/compose math using plain wp.array (no Fabric runtime required). Covers identity, translation-only, scale, and round-trip cases. Solid coverage.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant FrameView as FrameView (factory)
    participant FFV as FabricFrameView
    participant Cache as FabricStageCache
    participant Fabric as Fabric (GPU)
    participant USD as UsdFrameView

    Caller->>FrameView: FrameView(path, device)
    FrameView->>FrameView: _get_backend() checks fabric_enabled + device
    FrameView-->>FFV: instantiate FabricFrameView

    Caller->>FFV: get_world_poses()
    FFV->>FFV: _initialize_fabric() [lazy, first call only]
    FFV->>Cache: services[FabricStageCache] (get or create)
    Cache-->>FFV: stage + hierarchy handle
    FFV->>USD: get_world_poses() / get_local_poses() / get_scales()
    USD-->>FFV: USD-seeded pos/ori/scale
    FFV->>Fabric: compose_indexed_fabric_transforms (seed worldMatrix + localMatrix)
    FFV->>FFV: "_world_dirty = True"
    FFV->>FFV: _sync_world_from_local_if_dirty()
    FFV->>Fabric: update_indexed_world_matrix_from_local
    Fabric-->>FFV: child world matrices updated
    FFV->>Fabric: decompose_indexed_fabric_transforms
    FFV-->>Caller: ProxyArray(positions), ProxyArray(orientations)

    Caller->>FFV: set_local_poses(translations, orientations)
    FFV->>Fabric: compose_indexed_fabric_transforms to localMatrix
    FFV->>FFV: "_world_dirty = True"

    Caller->>FFV: get_world_poses()
    FFV->>FFV: _sync_world_from_local_if_dirty()
    FFV->>Fabric: update_indexed_world_matrix_from_local
    FFV->>Fabric: decompose_indexed_fabric_transforms
    FFV-->>Caller: consistent world poses
Loading

Reviews (1): Last reviewed commit: "fix: allow FabricFrameView on any cuda:<..." | Re-trigger Greptile

from pxr import Gf, Usd, UsdGeom

import isaaclab.sim as sim_utils
from isaaclab.app.settings_manager import SettingsManager
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 SettingsManager is imported at the module level but is no longer referenced anywhere in this file after the refactor moved the fabricEnabled check to FrameView._get_backend. The dead import adds a spurious load-time dependency.

Suggested change
from isaaclab.app.settings_manager import SettingsManager
# SettingsManager check moved to FrameView._get_backend; not needed here.

Comment on lines +474 to +478
def _get_parent_world_ro_array(self):
# Built and refreshed alongside the trans_ro selection (parents share that selection).
if self._parent_world_ifa_ro is None or self._trans_sel_ro.PrepareForReuse():
self._rebuild_trans_ro_arrays()
return self._parent_world_ifa_ro
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _get_parent_world_ro_array is defined but never called anywhere in the class. Every internal consumer of _parent_world_ifa_ro accesses it directly (after calling _rebuild_trans_ro_arrays as needed). Because the method calls _trans_sel_ro.PrepareForReuse(), leaving it around risks a stale caller inadvertently triggering a topology rebuild out-of-sequence. If it is intended for future use, a # noqa comment or a TODO note would clarify intent; otherwise it should be removed.

Comment on lines 351 to 382
@@ -290,168 +365,421 @@ def get_scales(self, indices=None):
scales_wp = wp.zeros((count, 3), dtype=wp.float32, device=self._device)

wp.launch(
kernel=fabric_utils.decompose_fabric_transformation_matrix_to_warp_arrays,
kernel=fabric_utils.decompose_indexed_fabric_transforms,
dim=count,
inputs=[
self._fabric_world_matrices,
self._fabric_dummy_buffer,
self._fabric_dummy_buffer,
self._get_world_ro_array(),
self._fabric_empty_2d_array_sentinel,
self._fabric_empty_2d_array_sentinel,
scales_wp,
indices_wp,
self._view_to_fabric,
],
device=self._fabric_device,
device=self._device,
)

if use_cached:
wp.synchronize()
return scales_wp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 get_scales is inconsistent with get_world_poses and get_local_poses in two ways. First, when indices is supplied (non-cached path), the method returns without calling wp.synchronize(), while still returning a raw wp.array — callers that immediately call .numpy() on it race against the running kernel. Second, the class docstring states "Pose getters return ProxyArray", but get_scales returns a bare wp.array. Wrapping the return in ProxyArray and adding a wp.synchronize() for the non-cached branch would align the API with the other getters.

Comment on lines +26 to +28
# Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp).
# Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc.
_FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None # None = compute dynamically
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The module-level variable _FABRIC_SUPPORTED_DEVICES is declared with a comment that implies it will be used as a lazy cache, but _is_fabric_supported_device never reads or writes it. The dead variable is misleading — it suggests a caching strategy that is not implemented.

Suggested change
# Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp).
# Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc.
_FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None # None = compute dynamically
# Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp).
# Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant