fix(gpu): add Tegra/Jetson GPU support by elezar · Pull Request #625 · NVIDIA/OpenShell

elezar · 2026-03-26T13:56:58Z

Summary

Adds GPU support for NVIDIA Tegra/Jetson platforms by bind-mounting the
host-files configuration directory, updating the device plugin image, and
preserving CDI-injected GIDs across privilege drop.

Related Issue

Part of #398 (CDI injection). Depends on #568 (Tegra system support). Should be merged after #495 and #503.

Upstream PRs:

nvidia-container-toolkit: fix: Additional GIDs are dropped due to file mode mask nvidia-container-toolkit#1745
k8s-device-plugin: Fix CDI spec generation to respect driver root for Tegra CSV files k8s-device-plugin#1675

Changes

Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d (read-only) into the gateway container when present, so the nvidia runtime inside k3s applies the same host-file injection config as the host — required for Jetson/Tegra CDI spec generation
Pin k8s-device-plugin to an image that supports host-files bind-mounts and generates additionalGids in the CDI spec (GID 44 / video, required for /dev/nvmap access on Tegra)
Preserve CDI-injected supplemental GIDs across initgroups() during privilege drop, so exec'd processes retain access to GPU devices
Fall back to /usr/sbin/nvidia-smi in the GPU e2e test for Tegra systems where nvidia-smi is not on the default PATH

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d (read-only) into the gateway container when it exists, so the nvidia runtime running inside k3s can apply the same host-file injection config as on the host — required for Jetson/Tegra platforms. Signed-off-by: Evan Lezar <elezar@nvidia.com>

Use ghcr.io/nvidia/k8s-device-plugin:2ab68c16 which includes support for mounting /etc/nvidia-container-runtime/host-files-for-container.d into the device plugin pod, required for correct CDI spec generation on Tegra-based systems. Also included is an nvcdi API bump that ensures that additional GIDs are included in the generated CDI spec. Signed-off-by: Evan Lezar <elezar@nvidia.com>

initgroups(3) replaces all supplemental groups with the user's entries from /etc/group, discarding GIDs injected by the container runtime via CDI (e.g. GID 44/video needed for /dev/nvmap on Tegra). Snapshot the container-level GIDs before initgroups runs and merge them back afterwards, excluding GID 0 (root) to avoid privilege retention. Signed-off-by: Evan Lezar <elezar@nvidia.com>

On Jetson/Tegra platforms nvidia-smi is installed at /usr/sbin/nvidia-smi rather than /usr/bin/nvidia-smi and may not be on PATH inside the sandbox. Fall back to the full path when the bare command is not found. Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar · 2026-03-26T13:58:13Z

cc @johnnynunez

johnnynunez · 2026-03-26T14:52:25Z

LGTM @elezar
ready to merge @johntmyers

elezar · 2026-03-27T07:15:59Z

This was only tested in conjunction with #495 and #503. Once those are in, there should be no reason to not get this in too.

johnnynunez · 2026-03-27T07:26:08Z

This

Yes, i know. I was tracking it. And tested

pimlock · 2026-04-01T18:48:24Z

I dug into the GID-preservation change here and I think PR #710 may make it unnecessary.

What I verified locally:

Inside a running sandbox, the GPU device nodes are owned by sandbox:sandbox after supervisor setup.
The corresponding host and k3s-container device nodes remain root:root 666, so the sandbox-side chown() does not appear to mutate the host devices.
That suggests these are container-local CDI-created device nodes, not direct host bind mounts.

If that holds generally, then once #710 adds the needed GPU device paths to filesystem.read_write, prepare_filesystem() will chown(path, uid, gid) before privilege drop and DAC access should come from ownership rather than from preserving CDI-injected supplemental groups.

So I think we should re-check whether the drop_privileges() GID merge is still needed after #710 lands. It may be removable if all required GPU paths (including Tegra-specific ones like /dev/nvmap if applicable) are present and successfully chowned.

pimlock · 2026-04-01T19:39:33Z

Follow-up: I removed the checked-in custom ghcr.io/nvidia/k8s-device-plugin:2ab68c16 image override from this branch.

If someone still needs that image on a live gateway for testing, they can patch the running cluster in place:

openshell doctor exec -- kubectl -n kube-system patch helmchart nvidia-device-plugin --type merge -p '{
  "spec": {
    "valuesContent": "image:\n  repository: ghcr.io/nvidia/k8s-device-plugin\n  tag: \"2ab68c16\"\nruntimeClassName: nvidia\ndeviceListStrategy: cdi-cri\ndeviceIDStrategy: index\ncdi:\n  nvidiaHookPath: /usr/bin/nvidia-cdi-hook\nnvidiaDriverRoot: \"/\"\ngfd:\n  enabled: false\nnfd:\n  enabled: false\naffinity: null\n"
  }
}'
openshell doctor exec -- kubectl -n nvidia-device-plugin rollout status ds/nvidia-device-plugin
openshell doctor exec -- kubectl -n nvidia-device-plugin get ds nvidia-device-plugin -o jsonpath='{.spec.template.spec.containers[0].image}{"\\n"}'

That only affects the running gateway. Recreating the gateway reapplies the checked-in manifest.

pimlock · 2026-04-01T20:44:23Z

Once the #710 is reviewed and merged, I will add it to here and test it again. I'm getting a lease on colossus for a Jetson-based system.

It's very likely there will be some updates to the policy required, with the #677 now merged and before that, in many contexts landlock policies were not correctly applied.

This reverts commit 730d104.

pimlock · 2026-04-07T00:09:34Z

crates/openshell-bootstrap/src/docker.rs

+    const HOST_FILES_DIR: &str = "/etc/nvidia-container-runtime/host-files-for-container.d";
+    if std::path::Path::new(HOST_FILES_DIR).is_dir() {
+        let mut binds = host_config.binds.take().unwrap_or_default();
+        binds.push(format!("{HOST_FILES_DIR}:{HOST_FILES_DIR}:ro"));
+        host_config.binds = Some(binds);
+    }


For the context, without this mount the failure is:

The error says: CDI --device-list-strategy options are only supported on NVML-based systems — the device plugin can't detect the GPU via NVML (since Tegra uses a different driver model), so it refuses to start with CDI mode.

The bind mount is needed. Without it, the NVIDIA toolkit inside the gateway can't recognize this as a Tegra platform with GPU capabilities, CDI spec generation fails, and the device plugin crashes.

- Bump nvidia-container-toolkit from 1.18.2 to 1.19.0 to support the -host-cuda-version flag used by newer CDI spec generation. - Replace local filesystem check for host-files-for-container.d with Docker API kernel version detection (contains "tegra"). This fixes remote SSH deploys where the CLI machine may not have the directory. - Only perform the Tegra check when GPU devices are requested.

pimlock · 2026-04-07T01:00:53Z

Testing on Jetson Thor (NVIDIA Thor GPU, driver 580.00, CUDA 13.0)

Validated the following on a physical Jetson Thor device:

Container Toolkit bump (1.18.2 → 1.19.0)

Required. The custom device plugin (k8s-device-plugin PR #1675) generates CDI specs with the --host-cuda-version flag. Toolkit 1.18.2's nvidia-cdi-hook doesn't recognize this flag, causing RunContainerError on GPU sandbox pods. 1.19.0 supports it.

`host-files-for-container.d` bind mount

Required. Without it, the device plugin cannot discover Tegra GPU devices and fails with "CDI options are only supported on NVML-based systems". The CSVs (devices.csv, drivers.csv) are inputs to CDI spec generation — they tell the toolkit which Tegra-specific device nodes and host libraries to inject.
Fixed the detection: replaced local Path::is_dir() check with docker.info() kernel version detection (contains("tegra")). The previous approach broke remote SSH deploys (CLI machine doesn't have the directory). Now gated on !device_ids.is_empty() so it's only checked for GPU gateways.

Custom device plugin (k8s-device-plugin PR #1675)

Required. Stock device plugin 0.18.2 can detect the Tegra platform and register with kubelet, but generates a nearly empty CDI spec (no device nodes, no library mounts). The custom build with driver-root-aware CSV resolution is needed for functional GPU access.
Tested: stock plugin → torch.cuda.is_available() returns False (no /dev/nvidia* in sandbox). Custom plugin → full PyTorch CUDA 13.0 matrix multiply succeeds on Thor.

GID merge in `process.rs`

Not needed today but harmless. CDI spec is v0.5.0 (no additionalGids support) and all device nodes are 0666. The code would become relevant with CDI v0.7.0+ and toolkit 1.19.1+ (nvidia-container-toolkit PR #1745).

Other findings

br_netfilter kernel module must be loaded on Tegra for k3s DNS/service networking. Without it, pods can't reach CoreDNS via ClusterIP. Good candidate for a pre-flight check.
Device plugin warnings about missing V4L2/GStreamer/legacy-tegra files are harmless — they're display/video codec libraries, not needed for CUDA compute.

…ID preservation - Log when Tegra platform is detected and host-files bind mount is added, including the kernel version from the Docker daemon. - Extract CDI GID snapshot logic into `snapshot_cdi_gids()` function that only activates when GPU devices are present (/dev/nvidiactl exists). - Log preserved CDI-injected GIDs when they are restored after initgroups. - Fix cargo fmt formatting issue in docker.rs.

pimlock · 2026-04-07T16:37:10Z

crates/openshell-bootstrap/src/docker.rs

+            const HOST_FILES_DIR: &str = "/etc/nvidia-container-runtime/host-files-for-container.d";
+            tracing::info!(
+                kernel_version = info.kernel_version.as_deref().unwrap_or("unknown"),
+                "Detected Tegra platform, bind-mounting {HOST_FILES_DIR} for CDI spec generation"
+            );


@elezar some context on this change: the previous version checking if that file exists wouldn't work for remote deployments, as the file would be checked against local file system and not the remote one.

By default, the gateways starts on the local system, but it can be deployed remotely with a command like this.

openshell gateways start --gpu --remote pmlocek@remote-host

In this case, the CLI is installed on my Mac and the gateway is deployed over SSH on the remote-host.

This approach will work for local and remote deployments, and is based on the kernel version including tegra, as at this point we don't have access to the filesystem on the remote host. We could add something like this, but it would be a bigger change.

I also was wondering if it would make sense to include that path /etc/nvidia-container-runtime/host-files-for-container.d in the CDI that is generated on that host? Could the container toolkit running on Tegra (or other systems that need these CSVs) add that mount to the CDI spec? Then, this manual mount wouldn't be necessary.

elezar added 4 commits March 26, 2026 14:44

elezar self-assigned this Mar 26, 2026

elezar marked this pull request as ready for review March 27, 2026 07:16

elezar requested a review from a team as a code owner March 27, 2026 07:16

pimlock added 2 commits April 1, 2026 11:55

Merge remote-tracking branch 'origin/main' into fix/tegra-gpu-support

2c06aad

fix(gpu): gate Tegra-specific runtime adjustments

730d104

pimlock added 2 commits April 2, 2026 00:24

Revert "fix(gpu): gate Tegra-specific runtime adjustments"

ccb6654

This reverts commit 730d104.

Merge remote-tracking branch 'origin/main' into fix/tegra-gpu-support

4eccf31

pimlock added the test:e2e Requires end-to-end coverage label Apr 2, 2026

pimlock reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gpu): add Tegra/Jetson GPU support#625

fix(gpu): add Tegra/Jetson GPU support#625
elezar wants to merge 10 commits intomainfrom
fix/tegra-gpu-support

elezar commented Mar 26, 2026 •

edited

Loading

Uh oh!

elezar commented Mar 26, 2026

Uh oh!

johnnynunez commented Mar 26, 2026 •

edited

Loading

Uh oh!

elezar commented Mar 27, 2026

Uh oh!

johnnynunez commented Mar 27, 2026

Uh oh!

pimlock commented Apr 1, 2026 •

edited

Loading

Uh oh!

pimlock commented Apr 1, 2026

Uh oh!

pimlock commented Apr 1, 2026

Uh oh!

pimlock Apr 7, 2026

Uh oh!

pimlock commented Apr 7, 2026 •

edited

Loading

Uh oh!

pimlock Apr 7, 2026

Uh oh!

pimlock Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

elezar commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

elezar commented Mar 26, 2026

Uh oh!

johnnynunez commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elezar commented Mar 27, 2026

Uh oh!

johnnynunez commented Mar 27, 2026

Uh oh!

pimlock commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pimlock commented Apr 1, 2026

Uh oh!

pimlock commented Apr 1, 2026

Uh oh!

pimlock Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

pimlock commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing on Jetson Thor (NVIDIA Thor GPU, driver 580.00, CUDA 13.0)

Container Toolkit bump (1.18.2 → 1.19.0)

host-files-for-container.d bind mount

Custom device plugin (k8s-device-plugin PR #1675)

GID merge in process.rs

Other findings

Uh oh!

pimlock Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

pimlock Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elezar commented Mar 26, 2026 •

edited

Loading

johnnynunez commented Mar 26, 2026 •

edited

Loading

pimlock commented Apr 1, 2026 •

edited

Loading

pimlock commented Apr 7, 2026 •

edited

Loading

`host-files-for-container.d` bind mount

GID merge in `process.rs`