feat: Full Fabric acceleration stack — local poses, stage cache, fused compose#5728
feat: Full Fabric acceleration stack — local poses, stage cache, fused compose#5728pv-nvidia wants to merge 6 commits into
Conversation
…pagation Add Warp kernels that operate on wp.indexedfabricarray for direct local↔world matrix propagation without round-tripping through USD: - decompose_indexed_fabric_transforms: extract pos/quat/scale from ifa - compose_indexed_fabric_transforms: write pos/quat/scale into ifa - update_indexed_local_matrix_from_world: local = inv(parent) * world - update_indexed_world_matrix_from_local: world = parent * local Also refactor existing kernels to use wp.where (branchless) instead of if/else for broadcast index selection. These kernels are the foundation for Fabric-accelerated get/set_local_poses in FabricFrameView.
…factory
FabricFrameView had an internal _use_fabric flag that fell back to
UsdFrameView when Fabric was disabled or the device was unsupported.
This violated single-responsibility: FabricFrameView pretended to be
one class but sometimes behaved as another.
Now the FrameView factory handles all dispatch:
- PhysX + Fabric enabled + supported device → FabricFrameView
- PhysX without Fabric (or unsupported device) → UsdFrameView
- Newton → NewtonSiteFrameView
FabricFrameView no longer checks _use_fabric or _fabric_supported_devices.
It assumes Fabric is available (the factory guarantees this).
UsdFrameView is eagerly registered on the factory since it lives in
isaaclab (not a backend package), so FactoryBase's dynamic import
(isaaclab_{backend}.sim.views) can't discover it.
Extract _compose_fabric_transform() to deduplicate the kernel-launch logic shared by set_world_poses and set_scales. The initial USD->Fabric sync now composes position, orientation, and scale in one call, so PrepareForReuse is invoked exactly once per logical update. Also replace assert with RuntimeError in _rebuild_fabric_arrays so the topology-change guard survives python -O.
Introduces FabricStageCache — a lightweight cache for the usdrt stage attachment and IFabricHierarchy handles, registered as a service on SimulationContext via the service locator (set_service/get_service). Multiple FabricFrameView instances now share a single hierarchy handle instead of each creating its own. The cache is automatically closed on SimulationContext.clear_instance(). Also replaces the assert device check with a proper RuntimeError and removes the now-unused isaaclab.sim import from fabric_frame_view.py.
- get_local_poses / set_local_poses via indexed fabric kernels - get_scales / set_scales with full Fabric acceleration - Per-view dirty tracking with automatic re-propagation - Topology-change recovery via _rebuild_fabric_arrays - 13 new integration tests (49 total pass)
The factory gate hardcoded cuda:0 as the only supported CUDA device. Multi-GPU setups need cuda:1, cuda:2, etc. Replace the static tuple with a function that accepts any valid cuda:<N> index.
There was a problem hiding this comment.
Code Review Summary
This is an impressive, well-architected PR that delivers significant performance improvements through Fabric GPU acceleration. The ~8× scene creation speedup and 150-260× faster pose operations (at 1024 prims) are substantial gains for large-scale simulations.
✅ Strengths
-
Clean architecture: The
FabricStageCacheservice pattern elegantly shares hierarchy handles across views without coupling. -
Correctness under non-trivial transforms: The transpose-convention handling in
update_indexed_local_matrix_from_worldandupdate_indexed_world_matrix_from_localcorrectly leverages(A·B)ᵀ = Bᵀ·Aᵀto avoid explicit transposes on Fabric's column-transposed storage. -
Robust dirty-flag isolation: Per-view
_world_dirtytracking prevents multi-view race conditions that would occur with a per-stage flag. -
Comprehensive test coverage: 13+ new integration tests cover rotated parents, scaled parents, multi-view isolation, and topology recovery scenarios.
-
Factory dispatch refinement: Moving Fabric/USD dispatch from
FabricFrameViewinternals toFrameView._get_backend()eliminates the misleading abstraction.
🔍 Findings & Suggestions
1. Potential race condition in concurrent PrepareForReuse calls
📍 fabric_frame_view.py — _get_world_rw_array() / _get_local_rw_array() / _get_world_ro_array()
Each accessor independently calls PrepareForReuse() on its selection. In a scenario where set_world_poses() and set_local_poses() are called in rapid succession (e.g., from different threads or coroutines), there's a potential for:
_world_sel_rw.PrepareForReuse()returnsTrue(topology change)- Rebuild starts but before completing,
_local_sel_rw.PrepareForReuse()is called - Both rebuilds race on shared state
Suggestion: Consider adding a simple lock or consolidating rebuild logic to ensure atomicity during topology changes, especially if this class may be used concurrently.
2. Missing validation for _fabric_empty_2d_array_sentinel shape
📍 fabric_frame_view.py:126 & fabric.py kernels
The sentinel is wp.zeros((0, 0), ...) with shape (0, 0). The kernels check shape[0] > 0 which works, but the docstring states the inner dim "is never indexed" while the sentinel has inner dim 0. This works but is semantically confusing.
Suggestion: Consider wp.zeros((0, 3), ...) or wp.zeros((0, 4), ...) to match the expected shapes for positions/quaternions respectively, making intent clearer.
3. Device validation could be more robust
📍 frame_view.py:25-36 — _is_fabric_supported_device()
if device.startswith("cuda:"):
try:
int(device.split(":", 1)[1])
return True
except (ValueError, IndexError):
passThis accepts "cuda:999" even if only 2 GPUs exist. Consider validating against torch.cuda.device_count() or wp.cuda_device_count().
Severity: Low — USDRT/Warp will fail downstream with a clearer error, but early validation would improve UX.
4. Silent skip for invalid parent prims in _initialize_fabric()
📍 fabric_frame_view.py:521-528
rt_prim = self._stage.GetPrimAtPath(path)
if not rt_prim.IsValid():
continue # Silent skipIf a parent prim is invalid, initialization silently skips it, but later _compute_parent_fabric_indices() will raise a RuntimeError. The skip-then-fail pattern could be confusing during debugging.
Suggestion: Either raise early with context, or log a warning when skipping invalid prims.
5. Unused _FABRIC_SUPPORTED_DEVICES module variable
📍 frame_view.py:19
_FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None # None = compute dynamicallyThis variable is declared but never used — _is_fabric_supported_device() hardcodes the logic instead.
Suggestion: Remove the unused variable or implement the promised "compute dynamically" behavior.
6. Test test_clear_instance_closes_service may have timing assumptions
📍 test_fabric_stage_cache.py:77-83
The test assumes SimulationContext.clear_instance() synchronously calls close() on services. If this happens asynchronously or is order-dependent, the assertion could flake.
Severity: Low — likely fine given IsaacLab's synchronous teardown pattern.
📊 Summary
| Category | Assessment |
|---|---|
| Correctness | ✅ Excellent — transform math is correct, edge cases well-handled |
| Performance | ✅ Excellent — 8× scene creation, 150-260× pose ops |
| Test Coverage | ✅ Comprehensive — 13+ new tests including edge cases |
| Architecture | ✅ Clean — service pattern, factory dispatch, per-view state |
| Documentation | ✅ Good — changelog entries, docstrings, inline comments |
| Maintainability |
Verdict
Approve — This is high-quality work that delivers substantial performance improvements with careful attention to correctness. The findings above are minor suggestions; none are blocking.
Greptile SummaryThis PR consolidates five individual PRs into a full Fabric acceleration stack for
Confidence Score: 4/5The core Fabric kernel math and the world ↔ local consistency protocol are both correct. The main risks are cosmetic: a dead import, a dead helper method, and an API inconsistency in get_scales. None of these affect correctness at runtime. The transpose-convention proof in the new kernels is correct, the per-view dirty-flag isolation is properly tested with a multi-view scenario, and the FabricStageCache lifecycle is wired correctly through ServiceLocator. The three style-level findings (unused import, dead method, get_scales return type) are real cleanup work but do not change observable behavior. The unused _FABRIC_SUPPORTED_DEVICES variable in frame_view.py is misleading but inert. fabric_frame_view.py warrants a second look to remove the unused SettingsManager import and the dead _get_parent_world_ro_array method before merging. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant FrameView as FrameView (factory)
participant FFV as FabricFrameView
participant Cache as FabricStageCache
participant Fabric as Fabric (GPU)
participant USD as UsdFrameView
Caller->>FrameView: FrameView(path, device)
FrameView->>FrameView: _get_backend() checks fabric_enabled + device
FrameView-->>FFV: instantiate FabricFrameView
Caller->>FFV: get_world_poses()
FFV->>FFV: _initialize_fabric() [lazy, first call only]
FFV->>Cache: services[FabricStageCache] (get or create)
Cache-->>FFV: stage + hierarchy handle
FFV->>USD: get_world_poses() / get_local_poses() / get_scales()
USD-->>FFV: USD-seeded pos/ori/scale
FFV->>Fabric: compose_indexed_fabric_transforms (seed worldMatrix + localMatrix)
FFV->>FFV: "_world_dirty = True"
FFV->>FFV: _sync_world_from_local_if_dirty()
FFV->>Fabric: update_indexed_world_matrix_from_local
Fabric-->>FFV: child world matrices updated
FFV->>Fabric: decompose_indexed_fabric_transforms
FFV-->>Caller: ProxyArray(positions), ProxyArray(orientations)
Caller->>FFV: set_local_poses(translations, orientations)
FFV->>Fabric: compose_indexed_fabric_transforms to localMatrix
FFV->>FFV: "_world_dirty = True"
Caller->>FFV: get_world_poses()
FFV->>FFV: _sync_world_from_local_if_dirty()
FFV->>Fabric: update_indexed_world_matrix_from_local
FFV->>Fabric: decompose_indexed_fabric_transforms
FFV-->>Caller: consistent world poses
Reviews (1): Last reviewed commit: "fix: allow FabricFrameView on any cuda:<..." | Re-trigger Greptile |
| from pxr import Gf, Usd, UsdGeom | ||
|
|
||
| import isaaclab.sim as sim_utils | ||
| from isaaclab.app.settings_manager import SettingsManager |
There was a problem hiding this comment.
SettingsManager is imported at the module level but is no longer referenced anywhere in this file after the refactor moved the fabricEnabled check to FrameView._get_backend. The dead import adds a spurious load-time dependency.
| from isaaclab.app.settings_manager import SettingsManager | |
| # SettingsManager check moved to FrameView._get_backend; not needed here. |
| def _get_parent_world_ro_array(self): | ||
| # Built and refreshed alongside the trans_ro selection (parents share that selection). | ||
| if self._parent_world_ifa_ro is None or self._trans_sel_ro.PrepareForReuse(): | ||
| self._rebuild_trans_ro_arrays() | ||
| return self._parent_world_ifa_ro |
There was a problem hiding this comment.
_get_parent_world_ro_array is defined but never called anywhere in the class. Every internal consumer of _parent_world_ifa_ro accesses it directly (after calling _rebuild_trans_ro_arrays as needed). Because the method calls _trans_sel_ro.PrepareForReuse(), leaving it around risks a stale caller inadvertently triggering a topology rebuild out-of-sequence. If it is intended for future use, a # noqa comment or a TODO note would clarify intent; otherwise it should be removed.
| @@ -290,168 +365,421 @@ def get_scales(self, indices=None): | |||
| scales_wp = wp.zeros((count, 3), dtype=wp.float32, device=self._device) | |||
|
|
|||
| wp.launch( | |||
| kernel=fabric_utils.decompose_fabric_transformation_matrix_to_warp_arrays, | |||
| kernel=fabric_utils.decompose_indexed_fabric_transforms, | |||
| dim=count, | |||
| inputs=[ | |||
| self._fabric_world_matrices, | |||
| self._fabric_dummy_buffer, | |||
| self._fabric_dummy_buffer, | |||
| self._get_world_ro_array(), | |||
| self._fabric_empty_2d_array_sentinel, | |||
| self._fabric_empty_2d_array_sentinel, | |||
| scales_wp, | |||
| indices_wp, | |||
| self._view_to_fabric, | |||
| ], | |||
| device=self._fabric_device, | |||
| device=self._device, | |||
| ) | |||
|
|
|||
| if use_cached: | |||
| wp.synchronize() | |||
| return scales_wp | |||
There was a problem hiding this comment.
get_scales is inconsistent with get_world_poses and get_local_poses in two ways. First, when indices is supplied (non-cached path), the method returns without calling wp.synchronize(), while still returning a raw wp.array — callers that immediately call .numpy() on it race against the running kernel. Second, the class docstring states "Pose getters return ProxyArray", but get_scales returns a bare wp.array. Wrapping the return in ProxyArray and adding a wp.synchronize() for the non-cached branch would align the API with the other getters.
| # Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp). | ||
| # Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc. | ||
| _FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None # None = compute dynamically |
There was a problem hiding this comment.
The module-level variable
_FABRIC_SUPPORTED_DEVICES is declared with a comment that implies it will be used as a lazy cache, but _is_fabric_supported_device never reads or writes it. The dead variable is misleading — it suggests a caching strategy that is not implemented.
| # Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp). | |
| # Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc. | |
| _FABRIC_SUPPORTED_DEVICES: tuple[str, ...] | None = None # None = compute dynamically | |
| # Devices supported by Fabric GPU acceleration (USDRT SelectPrims + Warp). | |
| # Any CUDA device is supported — multi-GPU setups use cuda:1, cuda:2, etc. |
Summary
Consolidated PR bringing the full Fabric acceleration stack to
FabricFrameView. This builds on the merged mGPU support (#5514) and adds:IndexedFabricArrayFabricFrameViewfallback to theFrameViewfactory (eliminates misleading abstraction)set_world_poses+set_scalesinstead of separate callsget_local_poses()/set_local_poses()operate directly on FabriclocalMatrix, with automatic world→local recompositionFabricFrameViewon anycuda:<N>device (was incorrectly restricted tocuda:0only)Motivation
Piotr's wrist camera shimmering on multi-GPU (#5717 context): parent prim local transforms change on
cuda:1but camera world pose doesn't get recomposed → stale poses. The local→world composition pipeline in this PR fixes camera pose propagation on non-primary GPUs.Performance
On 2x L40 (4 envs, wrist camera benchmark):
cuda:1now gets full Fabric acceleration (was silently falling back to USD)Testing
test_views_xform_prim_fabric.pyextended with local pose tests, topology rebuild tests, and multi-GPU roundtrip teststest_fabric_stage_cache.pyfor the stage cache servicetest_fabric_kernels.pyfor indexed transform kernelsIndividual PRs (now superseded by this consolidated PR)
pv/indexed-fabric-kernelspv/fabric-view-no-fallbackpv/fabric-fused-composepv/fabric-stage-cachepv/fabric-local-poses