feat: Full Fabric acceleration stack — local poses, stage cache, fused compose by pv-nvidia · Pull Request #5728 · isaac-sim/IsaacLab

pv-nvidia · 2026-05-21T14:18:36Z

Summary

Consolidated PR bringing the full Fabric acceleration stack to FabricFrameView. This builds on the merged mGPU support (#5514) and adds:

Indexed Fabric transform kernels — Warp kernels for local↔world matrix propagation using IndexedFabricArray
Factory-based dispatch — Move Fabric/USD selection from internal FabricFrameView fallback to the FrameView factory (eliminates misleading abstraction)
Fused compose kernel — Single kernel launch for set_world_poses + set_scales instead of separate calls
FabricStageCache — Shared hierarchy handles across views for faster scene creation (~8× improvement)
Fabric-accelerated local poses — get_local_poses() / set_local_poses() operate directly on Fabric localMatrix, with automatic world→local recomposition
Device gate fix — Allow FabricFrameView on any cuda:<N> device (was incorrectly restricted to cuda:0 only)

Motivation

Piotr's wrist camera shimmering on multi-GPU (#5717 context): parent prim local transforms change on cuda:1 but camera world pose doesn't get recomposed → stale poses. The local→world composition pipeline in this PR fixes camera pose propagation on non-primary GPUs.

Performance

On 2x L40 (4 envs, wrist camera benchmark):

Scene creation: 12.2s → 1.5s (8× faster from FabricStageCache)
FPS parity at low env counts (camera-render-dominated); gains at 64+ envs
cuda:1 now gets full Fabric acceleration (was silently falling back to USD)

Testing

Existing test_views_xform_prim_fabric.py extended with local pose tests, topology rebuild tests, and multi-GPU roundtrip tests
New test_fabric_stage_cache.py for the stage cache service
New test_fabric_kernels.py for indexed transform kernels

Individual PRs (now superseded by this consolidated PR)

feat: add indexed fabric transform kernels for local/world matrix propagation #5675 pv/indexed-fabric-kernels
refactor: move Fabric/USD dispatch from FabricFrameView to FrameView factory #5673 pv/fabric-view-no-fallback
refactor: fuse set_world_poses/set_scales into single _compose_fabric_transform #5674 pv/fabric-fused-compose
feat: add FabricStageCache service for shared hierarchy handles #5676 pv/fabric-stage-cache
feat: Fabric-accelerated get/set_local_poses via indexedfabricarray #5677 pv/fabric-local-poses

…pagation Add Warp kernels that operate on wp.indexedfabricarray for direct local↔world matrix propagation without round-tripping through USD: - decompose_indexed_fabric_transforms: extract pos/quat/scale from ifa - compose_indexed_fabric_transforms: write pos/quat/scale into ifa - update_indexed_local_matrix_from_world: local = inv(parent) * world - update_indexed_world_matrix_from_local: world = parent * local Also refactor existing kernels to use wp.where (branchless) instead of if/else for broadcast index selection. These kernels are the foundation for Fabric-accelerated get/set_local_poses in FabricFrameView.

…factory FabricFrameView had an internal _use_fabric flag that fell back to UsdFrameView when Fabric was disabled or the device was unsupported. This violated single-responsibility: FabricFrameView pretended to be one class but sometimes behaved as another. Now the FrameView factory handles all dispatch: - PhysX + Fabric enabled + supported device → FabricFrameView - PhysX without Fabric (or unsupported device) → UsdFrameView - Newton → NewtonSiteFrameView FabricFrameView no longer checks _use_fabric or _fabric_supported_devices. It assumes Fabric is available (the factory guarantees this). UsdFrameView is eagerly registered on the factory since it lives in isaaclab (not a backend package), so FactoryBase's dynamic import (isaaclab_{backend}.sim.views) can't discover it.

Extract _compose_fabric_transform() to deduplicate the kernel-launch logic shared by set_world_poses and set_scales. The initial USD->Fabric sync now composes position, orientation, and scale in one call, so PrepareForReuse is invoked exactly once per logical update. Also replace assert with RuntimeError in _rebuild_fabric_arrays so the topology-change guard survives python -O.

Introduces FabricStageCache — a lightweight cache for the usdrt stage attachment and IFabricHierarchy handles, registered as a service on SimulationContext via the service locator (set_service/get_service). Multiple FabricFrameView instances now share a single hierarchy handle instead of each creating its own. The cache is automatically closed on SimulationContext.clear_instance(). Also replaces the assert device check with a proper RuntimeError and removes the now-unused isaaclab.sim import from fabric_frame_view.py.

- get_local_poses / set_local_poses via indexed fabric kernels - get_scales / set_scales with full Fabric acceleration - Per-view dirty tracking with automatic re-propagation - Topology-change recovery via _rebuild_fabric_arrays - 13 new integration tests (49 total pass)

The factory gate hardcoded cuda:0 as the only supported CUDA device. Multi-GPU setups need cuda:1, cuda:2, etc. Replace the static tuple with a function that accepts any valid cuda:<N> index.

isaaclab-review-bot

Code Review Summary

This is an impressive, well-architected PR that delivers significant performance improvements through Fabric GPU acceleration. The ~8× scene creation speedup and 150-260× faster pose operations (at 1024 prims) are substantial gains for large-scale simulations.

✅ Strengths

Clean architecture: The FabricStageCache service pattern elegantly shares hierarchy handles across views without coupling.
Correctness under non-trivial transforms: The transpose-convention handling in update_indexed_local_matrix_from_world and update_indexed_world_matrix_from_local correctly leverages (A·B)ᵀ = Bᵀ·Aᵀ to avoid explicit transposes on Fabric's column-transposed storage.
Robust dirty-flag isolation: Per-view _world_dirty tracking prevents multi-view race conditions that would occur with a per-stage flag.
Comprehensive test coverage: 13+ new integration tests cover rotated parents, scaled parents, multi-view isolation, and topology recovery scenarios.
Factory dispatch refinement: Moving Fabric/USD dispatch from FabricFrameView internals to FrameView._get_backend() eliminates the misleading abstraction.

🔍 Findings & Suggestions

1. Potential race condition in concurrent PrepareForReuse calls
📍 fabric_frame_view.py — _get_world_rw_array() / _get_local_rw_array() / _get_world_ro_array()

Each accessor independently calls PrepareForReuse() on its selection. In a scenario where set_world_poses() and set_local_poses() are called in rapid succession (e.g., from different threads or coroutines), there's a potential for:

_world_sel_rw.PrepareForReuse() returns True (topology change)
Rebuild starts but before completing, _local_sel_rw.PrepareForReuse() is called
Both rebuilds race on shared state

Suggestion: Consider adding a simple lock or consolidating rebuild logic to ensure atomicity during topology changes, especially if this class may be used concurrently.

2. Missing validation for _fabric_empty_2d_array_sentinel shape
📍 fabric_frame_view.py:126 & fabric.py kernels

The sentinel is wp.zeros((0, 0), ...) with shape (0, 0). The kernels check shape[0] > 0 which works, but the docstring states the inner dim "is never indexed" while the sentinel has inner dim 0. This works but is semantically confusing.

Suggestion: Consider wp.zeros((0, 3), ...) or wp.zeros((0, 4), ...) to match the expected shapes for positions/quaternions respectively, making intent clearer.

3. Device validation could be more robust
📍 frame_view.py:25-36 — _is_fabric_supported_device()

if device.startswith("cuda:"):
    try:
        int(device.split(":", 1)[1])
        return True
    except (ValueError, IndexError):
        pass

This accepts "cuda:999" even if only 2 GPUs exist. Consider validating against torch.cuda.device_count() or wp.cuda_device_count().

Severity: Low — USDRT/Warp will fail downstream with a clearer error, but early validation would improve UX.

4. Silent skip for invalid parent prims in _initialize_fabric()
📍 fabric_frame_view.py:521-528

rt_prim = self._stage.GetPrimAtPath(path)
if not rt_prim.IsValid():
    continue  # Silent skip

If a parent prim is invalid, initialization silently skips it, but later _compute_parent_fabric_indices() will raise a RuntimeError. The skip-then-fail pattern could be confusing during debugging.

Suggestion: Either raise early with context, or log a warning when skipping invalid prims.

5. Unused _FABRIC_SUPPORTED_DEVICES module variable
📍 frame_view.py:19

_FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None  # None = compute dynamically

This variable is declared but never used — _is_fabric_supported_device() hardcodes the logic instead.

Suggestion: Remove the unused variable or implement the promised "compute dynamically" behavior.

6. Test test_clear_instance_closes_service may have timing assumptions
📍 test_fabric_stage_cache.py:77-83

The test assumes SimulationContext.clear_instance() synchronously calls close() on services. If this happens asynchronously or is order-dependent, the assertion could flake.

Severity: Low — likely fine given IsaacLab's synchronous teardown pattern.

📊 Summary

Category	Assessment
Correctness	✅ Excellent — transform math is correct, edge cases well-handled
Performance	✅ Excellent — 8× scene creation, 150-260× pose ops
Test Coverage	✅ Comprehensive — 13+ new tests including edge cases
Architecture	✅ Clean — service pattern, factory dispatch, per-view state
Documentation	✅ Good — changelog entries, docstrings, inline comments
Maintainability	⚠️ Good — some complexity in multi-selection management

Verdict

Approve — This is high-quality work that delivers substantial performance improvements with careful attention to correctness. The findings above are minor suggestions; none are blocking.

greptile-apps · 2026-05-21T14:33:45Z

Greptile Summary

This PR consolidates five individual PRs into a full Fabric acceleration stack for FabricFrameView, adding indexed transform kernels, a shared FabricStageCache, and Fabric-native get_local_poses/set_local_poses. The factory-level dispatch in FrameView is refactored so that Fabric vs. USD selection happens at creation time instead of inside the view.

Fabric-native local poses are implemented by writing/reading omni:fabric:localMatrix directly via IndexedFabricArray, with a per-view _world_dirty flag to trigger lazy world-from-local recomposition on the next world read.
FabricStageCache is added as a ServiceLocator-managed singleton that shares one IFabricHierarchy handle across all views on the same stage, reducing scene-creation time ~8×.
Fused compose kernel (compose_indexed_fabric_transforms) replaces the separate set_world_poses + set_scales kernel launches, and the assert prim-count guard is promoted to RuntimeError so it fires at every optimization level.

Confidence Score: 4/5

The core Fabric kernel math and the world ↔ local consistency protocol are both correct. The main risks are cosmetic: a dead import, a dead helper method, and an API inconsistency in get_scales. None of these affect correctness at runtime.

The transpose-convention proof in the new kernels is correct, the per-view dirty-flag isolation is properly tested with a multi-view scenario, and the FabricStageCache lifecycle is wired correctly through ServiceLocator. The three style-level findings (unused import, dead method, get_scales return type) are real cleanup work but do not change observable behavior. The unused _FABRIC_SUPPORTED_DEVICES variable in frame_view.py is misleading but inert.

fabric_frame_view.py warrants a second look to remove the unused SettingsManager import and the dead _get_parent_world_ro_array method before merging.

Important Files Changed

Filename	Overview
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py	Major refactor: removes internal USD fallback, adds Fabric-native local poses, three persistent selections, indexed fabric arrays, and per-view dirty tracking. Contains an unused import (SettingsManager) and a dead helper method (_get_parent_world_ro_array) that are leftovers from the refactor.
source/isaaclab/isaaclab/utils/warp/fabric.py	Adds four new Warp kernels (decompose/compose indexed, update_indexed_local/world_from_world/local) and replaces if/else branching with wp.where. Transpose convention math is correct; kernel signatures are consistent.
source/isaaclab/isaaclab/sim/views/frame_view.py	Factory dispatch moved from FabricFrameView internal fallback to this factory. Adds _is_fabric_supported_device, but declares _FABRIC_SUPPORTED_DEVICES module-level variable that is never read. No functional bugs found.
source/isaaclab_physx/isaaclab_physx/sim/fabric_stage_cache.py	New lightweight singleton cache for usdrt Stage attachment and IFabricHierarchy handle. Lifecycle is correctly managed by ServiceLocator (close() called on SimulationContext teardown). No issues found.
source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py	Removes xfail override for set_world_updates_local (now fixed), adds tests for Fabric local poses, rotated-parent transpose convention, scaled-parent seed, and per-view dirty-flag isolation. Topology-change test no longer exercises real PrepareForReuse=True path.
source/isaaclab_physx/test/sim/test_fabric_stage_cache.py	New test suite for FabricStageCache; covers register/retrieve, hierarchy caching, close, and clear_instance lifecycle. Complete and correct.
source/isaaclab/test/utils/warp/test_fabric_kernels.py	New unit tests for decompose/compose math using plain wp.array (no Fabric runtime required). Covers identity, translation-only, scale, and round-trip cases. Solid coverage.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant FrameView as FrameView (factory)
    participant FFV as FabricFrameView
    participant Cache as FabricStageCache
    participant Fabric as Fabric (GPU)
    participant USD as UsdFrameView

    Caller->>FrameView: FrameView(path, device)
    FrameView->>FrameView: _get_backend() checks fabric_enabled + device
    FrameView-->>FFV: instantiate FabricFrameView

    Caller->>FFV: get_world_poses()
    FFV->>FFV: _initialize_fabric() [lazy, first call only]
    FFV->>Cache: services[FabricStageCache] (get or create)
    Cache-->>FFV: stage + hierarchy handle
    FFV->>USD: get_world_poses() / get_local_poses() / get_scales()
    USD-->>FFV: USD-seeded pos/ori/scale
    FFV->>Fabric: compose_indexed_fabric_transforms (seed worldMatrix + localMatrix)
    FFV->>FFV: "_world_dirty = True"
    FFV->>FFV: _sync_world_from_local_if_dirty()
    FFV->>Fabric: update_indexed_world_matrix_from_local
    Fabric-->>FFV: child world matrices updated
    FFV->>Fabric: decompose_indexed_fabric_transforms
    FFV-->>Caller: ProxyArray(positions), ProxyArray(orientations)

    Caller->>FFV: set_local_poses(translations, orientations)
    FFV->>Fabric: compose_indexed_fabric_transforms to localMatrix
    FFV->>FFV: "_world_dirty = True"

    Caller->>FFV: get_world_poses()
    FFV->>FFV: _sync_world_from_local_if_dirty()
    FFV->>Fabric: update_indexed_world_matrix_from_local
    FFV->>Fabric: decompose_indexed_fabric_transforms
    FFV-->>Caller: consistent world poses

_{Reviews (1): Last reviewed commit: "fix: allow FabricFrameView on any cuda:<..." | Re-trigger Greptile}

greptile-apps · 2026-05-21T14:33:49Z

+from pxr import Gf, Usd, UsdGeom

 import isaaclab.sim as sim_utils
 from isaaclab.app.settings_manager import SettingsManager


SettingsManager is imported at the module level but is no longer referenced anywhere in this file after the refactor moved the fabricEnabled check to FrameView._get_backend. The dead import adds a spurious load-time dependency.

Suggested change

from isaaclab.app.settings_manager import SettingsManager

# SettingsManager check moved to FrameView._get_backend; not needed here.

greptile-apps · 2026-05-21T14:33:50Z

+    def _get_parent_world_ro_array(self):
+        # Built and refreshed alongside the trans_ro selection (parents share that selection).
+        if self._parent_world_ifa_ro is None or self._trans_sel_ro.PrepareForReuse():
+            self._rebuild_trans_ro_arrays()
+        return self._parent_world_ifa_ro


_get_parent_world_ro_array is defined but never called anywhere in the class. Every internal consumer of _parent_world_ifa_ro accesses it directly (after calling _rebuild_trans_ro_arrays as needed). Because the method calls _trans_sel_ro.PrepareForReuse(), leaving it around risks a stale caller inadvertently triggering a topology rebuild out-of-sequence. If it is intended for future use, a # noqa comment or a TODO note would clarify intent; otherwise it should be removed.

greptile-apps · 2026-05-21T14:33:52Z

@@ -290,168 +365,421 @@ def get_scales(self, indices=None):
            scales_wp = wp.zeros((count, 3), dtype=wp.float32, device=self._device)

        wp.launch(
-            kernel=fabric_utils.decompose_fabric_transformation_matrix_to_warp_arrays,
+            kernel=fabric_utils.decompose_indexed_fabric_transforms,
            dim=count,
            inputs=[
-                self._fabric_world_matrices,
-                self._fabric_dummy_buffer,
-                self._fabric_dummy_buffer,
+                self._get_world_ro_array(),
+                self._fabric_empty_2d_array_sentinel,
+                self._fabric_empty_2d_array_sentinel,
                scales_wp,
                indices_wp,
-                self._view_to_fabric,
            ],
-            device=self._fabric_device,
+            device=self._device,
        )

        if use_cached:
            wp.synchronize()
        return scales_wp


get_scales is inconsistent with get_world_poses and get_local_poses in two ways. First, when indices is supplied (non-cached path), the method returns without calling wp.synchronize(), while still returning a raw wp.array — callers that immediately call .numpy() on it race against the running kernel. Second, the class docstring states "Pose getters return ProxyArray", but get_scales returns a bare wp.array. Wrapping the return in ProxyArray and adding a wp.synchronize() for the non-cached branch would align the API with the other getters.

greptile-apps · 2026-05-21T14:33:53Z

+# Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp).
+# Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc.
+_FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None  # None = compute dynamically


The module-level variable _FABRIC_SUPPORTED_DEVICES is declared with a comment that implies it will be used as a lazy cache, but _is_fabric_supported_device never reads or writes it. The dead variable is misleading — it suggests a caching strategy that is not implemented.

Suggested change

# Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp).

# Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc.

_FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None # None = compute dynamically

# Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp).

# Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc.

pv-nvidia added 6 commits May 21, 2026 14:07

fix: allow FabricFrameView on any cuda:<N> device

2b03346

The factory gate hardcoded cuda:0 as the only supported CUDA device. Multi-GPU setups need cuda:1, cuda:2, etc. Replace the static tuple with a function that accepts any valid cuda:<N> index.

pv-nvidia requested review from Mayankm96, jtigue-bdai, kellyguo11 and pascal-roth as code owners May 21, 2026 14:18

github-actions Bot added documentation Improvements or additions to documentation isaac-lab Related to Isaac Lab team labels May 21, 2026

pv-nvidia marked this pull request as draft May 21, 2026 14:19

isaaclab-review-bot Bot reviewed May 21, 2026

View reviewed changes

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Full Fabric acceleration stack — local poses, stage cache, fused compose#5728

feat: Full Fabric acceleration stack — local poses, stage cache, fused compose#5728
pv-nvidia wants to merge 6 commits into
isaac-sim:developfrom
pv-nvidia:pv/fabric-full-stack

pv-nvidia commented May 21, 2026

Uh oh!

isaaclab-review-bot Bot left a comment

Uh oh!

greptile-apps Bot commented May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	from isaaclab.app.settings_manager import SettingsManager
	# SettingsManager check moved to FrameView._get_backend; not needed here.

Conversation

pv-nvidia commented May 21, 2026

Summary

Motivation

Performance

Testing

Individual PRs (now superseded by this consolidated PR)

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

Code Review Summary

✅ Strengths

🔍 Findings & Suggestions

📊 Summary

Verdict

Uh oh!

greptile-apps Bot commented May 21, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant