Skip to content

fix(gpu): add Tegra/Jetson GPU support#625

Open
elezar wants to merge 10 commits intomainfrom
fix/tegra-gpu-support
Open

fix(gpu): add Tegra/Jetson GPU support#625
elezar wants to merge 10 commits intomainfrom
fix/tegra-gpu-support

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented Mar 26, 2026

Summary

Adds GPU support for NVIDIA Tegra/Jetson platforms by bind-mounting the
host-files configuration directory, updating the device plugin image, and
preserving CDI-injected GIDs across privilege drop.

Related Issue

Part of #398 (CDI injection). Depends on #568 (Tegra system support). Should be merged after #495 and #503.

Upstream PRs:

Changes

  • Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d (read-only) into the gateway container when present, so the nvidia runtime inside k3s applies the same host-file injection config as the host — required for Jetson/Tegra CDI spec generation
  • Pin k8s-device-plugin to an image that supports host-files bind-mounts and generates additionalGids in the CDI spec (GID 44 / video, required for /dev/nvmap access on Tegra)
  • Preserve CDI-injected supplemental GIDs across initgroups() during privilege drop, so exec'd processes retain access to GPU devices
  • Fall back to /usr/sbin/nvidia-smi in the GPU e2e test for Tegra systems where nvidia-smi is not on the default PATH

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

elezar added 4 commits March 26, 2026 14:44
Bind-mount /etc/nvidia-container-runtime/host-files-for-container.d
(read-only) into the gateway container when it exists, so the nvidia
runtime running inside k3s can apply the same host-file injection
config as on the host — required for Jetson/Tegra platforms.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Use ghcr.io/nvidia/k8s-device-plugin:2ab68c16 which includes support for
mounting /etc/nvidia-container-runtime/host-files-for-container.d into the
device plugin pod, required for correct CDI spec generation on Tegra-based
systems.

Also included is an nvcdi API bump that ensures that additional GIDs are
included in the generated CDI spec.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
initgroups(3) replaces all supplemental groups with the user's entries
from /etc/group, discarding GIDs injected by the container runtime via
CDI (e.g. GID 44/video needed for /dev/nvmap on Tegra). Snapshot the
container-level GIDs before initgroups runs and merge them back
afterwards, excluding GID 0 (root) to avoid privilege retention.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
On Jetson/Tegra platforms nvidia-smi is installed at /usr/sbin/nvidia-smi
rather than /usr/bin/nvidia-smi and may not be on PATH inside the sandbox.
Fall back to the full path when the bare command is not found.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar self-assigned this Mar 26, 2026
@elezar
Copy link
Copy Markdown
Member Author

elezar commented Mar 26, 2026

cc @johnnynunez

@johnnynunez
Copy link
Copy Markdown

johnnynunez commented Mar 26, 2026

LGTM @elezar
ready to merge @johntmyers

@elezar
Copy link
Copy Markdown
Member Author

elezar commented Mar 27, 2026

This was only tested in conjunction with #495 and #503. Once those are in, there should be no reason to not get this in too.

@elezar elezar marked this pull request as ready for review March 27, 2026 07:16
@elezar elezar requested a review from a team as a code owner March 27, 2026 07:16
@johnnynunez
Copy link
Copy Markdown

This

Yes, i know. I was tracking it. And tested

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

I dug into the GID-preservation change here and I think PR #710 may make it unnecessary.

What I verified locally:

  • Inside a running sandbox, the GPU device nodes are owned by sandbox:sandbox after supervisor setup.
  • The corresponding host and k3s-container device nodes remain root:root 666, so the sandbox-side chown() does not appear to mutate the host devices.
  • That suggests these are container-local CDI-created device nodes, not direct host bind mounts.

If that holds generally, then once #710 adds the needed GPU device paths to filesystem.read_write, prepare_filesystem() will chown(path, uid, gid) before privilege drop and DAC access should come from ownership rather than from preserving CDI-injected supplemental groups.

So I think we should re-check whether the drop_privileges() GID merge is still needed after #710 lands. It may be removable if all required GPU paths (including Tegra-specific ones like /dev/nvmap if applicable) are present and successfully chowned.

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

Follow-up: I removed the checked-in custom ghcr.io/nvidia/k8s-device-plugin:2ab68c16 image override from this branch.

If someone still needs that image on a live gateway for testing, they can patch the running cluster in place:

openshell doctor exec -- kubectl -n kube-system patch helmchart nvidia-device-plugin --type merge -p '{
  "spec": {
    "valuesContent": "image:\n  repository: ghcr.io/nvidia/k8s-device-plugin\n  tag: \"2ab68c16\"\nruntimeClassName: nvidia\ndeviceListStrategy: cdi-cri\ndeviceIDStrategy: index\ncdi:\n  nvidiaHookPath: /usr/bin/nvidia-cdi-hook\nnvidiaDriverRoot: \"/\"\ngfd:\n  enabled: false\nnfd:\n  enabled: false\naffinity: null\n"
  }
}'
openshell doctor exec -- kubectl -n nvidia-device-plugin rollout status ds/nvidia-device-plugin
openshell doctor exec -- kubectl -n nvidia-device-plugin get ds nvidia-device-plugin -o jsonpath='{.spec.template.spec.containers[0].image}{"\\n"}'

That only affects the running gateway. Recreating the gateway reapplies the checked-in manifest.

@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 1, 2026

Once the #710 is reviewed and merged, I will add it to here and test it again. I'm getting a lease on colossus for a Jetson-based system.

It's very likely there will be some updates to the policy required, with the #677 now merged and before that, in many contexts landlock policies were not correctly applied.

@pimlock pimlock added the test:e2e Requires end-to-end coverage label Apr 2, 2026
Comment on lines +605 to +610
const HOST_FILES_DIR: &str = "/etc/nvidia-container-runtime/host-files-for-container.d";
if std::path::Path::new(HOST_FILES_DIR).is_dir() {
let mut binds = host_config.binds.take().unwrap_or_default();
binds.push(format!("{HOST_FILES_DIR}:{HOST_FILES_DIR}:ro"));
host_config.binds = Some(binds);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the context, without this mount the failure is:

The error says: CDI --device-list-strategy options are only supported on NVML-based systems — the device plugin can't detect the GPU via NVML (since Tegra uses a different driver model), so it refuses to start with CDI mode.

The bind mount is needed. Without it, the NVIDIA toolkit inside the gateway can't recognize this as a Tegra platform with GPU capabilities, CDI spec generation fails, and the device plugin crashes.

- Bump nvidia-container-toolkit from 1.18.2 to 1.19.0 to support the
  -host-cuda-version flag used by newer CDI spec generation.
- Replace local filesystem check for host-files-for-container.d with
  Docker API kernel version detection (contains "tegra"). This fixes
  remote SSH deploys where the CLI machine may not have the directory.
- Only perform the Tegra check when GPU devices are requested.
@pimlock
Copy link
Copy Markdown
Collaborator

pimlock commented Apr 7, 2026

Testing on Jetson Thor (NVIDIA Thor GPU, driver 580.00, CUDA 13.0)

Validated the following on a physical Jetson Thor device:

Container Toolkit bump (1.18.2 → 1.19.0)

  • Required. The custom device plugin (k8s-device-plugin PR #1675) generates CDI specs with the --host-cuda-version flag. Toolkit 1.18.2's nvidia-cdi-hook doesn't recognize this flag, causing RunContainerError on GPU sandbox pods. 1.19.0 supports it.

host-files-for-container.d bind mount

  • Required. Without it, the device plugin cannot discover Tegra GPU devices and fails with "CDI options are only supported on NVML-based systems". The CSVs (devices.csv, drivers.csv) are inputs to CDI spec generation — they tell the toolkit which Tegra-specific device nodes and host libraries to inject.
  • Fixed the detection: replaced local Path::is_dir() check with docker.info() kernel version detection (contains("tegra")). The previous approach broke remote SSH deploys (CLI machine doesn't have the directory). Now gated on !device_ids.is_empty() so it's only checked for GPU gateways.

Custom device plugin (k8s-device-plugin PR #1675)

  • Required. Stock device plugin 0.18.2 can detect the Tegra platform and register with kubelet, but generates a nearly empty CDI spec (no device nodes, no library mounts). The custom build with driver-root-aware CSV resolution is needed for functional GPU access.
  • Tested: stock plugin → torch.cuda.is_available() returns False (no /dev/nvidia* in sandbox). Custom plugin → full PyTorch CUDA 13.0 matrix multiply succeeds on Thor.

GID merge in process.rs

  • Not needed today but harmless. CDI spec is v0.5.0 (no additionalGids support) and all device nodes are 0666. The code would become relevant with CDI v0.7.0+ and toolkit 1.19.1+ (nvidia-container-toolkit PR #1745).

Other findings

  • br_netfilter kernel module must be loaded on Tegra for k3s DNS/service networking. Without it, pods can't reach CoreDNS via ClusterIP. Good candidate for a pre-flight check.
  • Device plugin warnings about missing V4L2/GStreamer/legacy-tegra files are harmless — they're display/video codec libraries, not needed for CUDA compute.

…ID preservation

- Log when Tegra platform is detected and host-files bind mount is added,
  including the kernel version from the Docker daemon.
- Extract CDI GID snapshot logic into `snapshot_cdi_gids()` function that
  only activates when GPU devices are present (/dev/nvidiactl exists).
- Log preserved CDI-injected GIDs when they are restored after initgroups.
- Fix cargo fmt formatting issue in docker.rs.
Comment on lines +616 to +620
const HOST_FILES_DIR: &str = "/etc/nvidia-container-runtime/host-files-for-container.d";
tracing::info!(
kernel_version = info.kernel_version.as_deref().unwrap_or("unknown"),
"Detected Tegra platform, bind-mounting {HOST_FILES_DIR} for CDI spec generation"
);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar some context on this change: the previous version checking if that file exists wouldn't work for remote deployments, as the file would be checked against local file system and not the remote one.

By default, the gateways starts on the local system, but it can be deployed remotely with a command like this.

openshell gateways start --gpu --remote pmlocek@remote-host

In this case, the CLI is installed on my Mac and the gateway is deployed over SSH on the remote-host.

This approach will work for local and remote deployments, and is based on the kernel version including tegra, as at this point we don't have access to the filesystem on the remote host. We could add something like this, but it would be a bigger change.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also was wondering if it would make sense to include that path /etc/nvidia-container-runtime/host-files-for-container.d in the CDI that is generated on that host? Could the container toolkit running on Tegra (or other systems that need these CSVs) add that mount to the CDI spec? Then, this manual mount wouldn't be necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants