Skip to content

[24.04_linux-nvidia-6.17-next] NVIDIA: VR: SAUCE: cxl: guard unlinked endpoints#465

Closed
nirmoy wants to merge 1 commit into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
nirmoy:codex/nvbug6274048-cxl-endpoint-guard-6.17-next
Closed

[24.04_linux-nvidia-6.17-next] NVIDIA: VR: SAUCE: cxl: guard unlinked endpoints#465
nirmoy wants to merge 1 commit into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
nirmoy:codex/nvbug6274048-cxl-endpoint-guard-6.17-next

Conversation

@nirmoy

@nirmoy nirmoy commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fix NVBug 6274048 for 24.04_linux-nvidia-6.17-next.

Launchpad: https://bugs.launchpad.net/bugs/2143032

cxlmd->endpoint starts as ERR_PTR(-ENXIO) until endpoint port registration completes. Guard CXL helper paths with IS_ERR_OR_NULL() before dereferencing it.

BOS note

I did not find evidence that this 6.17-next boot/probe NULL dereference has reproduced on BOS. BOS CXL Type-2/reset coverage is tracked separately:

Those SHAs are on upstream/26.04_linux-nvidia-bos. If the same early cxlmd->endpoint access is reproduced on BOS, backport this guard there too.

Testing

  • git diff --check
  • scripts/checkpatch.pl --strict --no-tree: clean
  • Focused CXL object build passed in a clean temp worktree with CXL options enabled; CONFIG_WERROR was disabled for an existing unrelated enum cxl_regloc_type warning.

@nirmoy

nirmoy commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

BaseOS Kernel Review

Summary

No issues found across the reviewed commits.

Findings: no problems found

Latest watcher review: open review

Kernel deb build: failed (failure log, build artifacts)

Head: 2be47e06e588

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review.

@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ✅ All checks passed

Details
Checking 1 commits...

Cherry-pick digest:
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject                              │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 2be47e06e588 │ [SAUCE] cxl: guard unlinked memdev endpoints                     │ N/A        │ N/A     │ nirmoyd                   │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

@nirmoy nirmoy force-pushed the codex/nvbug6274048-cxl-endpoint-guard-6.17-next branch from 5dd7359 to a945fdb Compare June 15, 2026 15:07
@nirmoy nirmoy changed the title [24.04_linux-nvidia-6.17-next] cxl: guard unlinked memdev endpoints [24.04_linux-nvidia-6.17-next] NVIDIA: VR: SAUCE: cxl: guard unlinked endpoints Jun 15, 2026
@nvidia-bfigg nvidia-bfigg force-pushed the 24.04_linux-nvidia-6.17-next branch 2 times, most recently from 7a62271 to 51267da Compare June 19, 2026 12:02
@nvidia-bfigg nvidia-bfigg force-pushed the 24.04_linux-nvidia-6.17-next branch from 4a7f97e to 2333f65 Compare June 25, 2026 12:07
cxlmd->endpoint starts as ERR_PTR(-ENXIO) until endpoint port registration
links the memdev to a real cxl_port.

Treat NULL and error pointers as "endpoint not linked" before dereferencing
cxlmd->endpoint in CXL helper paths.

Fixes: eb61834 ("cxl/mem: Introduce cxl_memdev_attach for CXL-dependent operation")
Signed-off-by: Nirmoy Das <nirmoyd@nvidia.com>
@nirmoy nirmoy force-pushed the codex/nvbug6274048-cxl-endpoint-guard-6.17-next branch from a945fdb to 2be47e0 Compare June 25, 2026 16:18
@nirmoy nirmoy marked this pull request as ready for review June 25, 2026 16:34
@nirmoy nirmoy requested review from clsotog and nvmochs June 25, 2026 16:35
@nirmoy nirmoy added the help wanted Extra attention is needed label Jun 25, 2026
@nvmochs

nvmochs commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

No issues with this patch.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@clsotog clsotog left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acked-by: Carol L Soto <csoto@nvidia.com>

@nirmoy nirmoy added has_2_acks and removed help wanted Extra attention is needed has_1_ack labels Jun 25, 2026

@sforshee sforshee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch looks good.

Acked-by: Seth Forshee <sforshee@nvidia.com>

@nvmochs

nvmochs commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Merged, closing PR.

aff4ccee4530 NVIDIA: VR: SAUCE: cxl: Guard unlinked memdev endpoints

@nvmochs nvmochs closed this Jun 25, 2026
nvidia-bfigg pushed a commit that referenced this pull request Jun 27, 2026
The current code updates the tail call counter (TCC) using a pre-increment
approach, it stores the incremented value back to memory before performing
any boundary or target validation checks.

This causes two major issues:
1. When a tail call fails because the target program is NULL, the TCC is
   incorrectly incremented and saved in memory anyway.
2. This dummy increment implicitly consumes one slot of the allowed tail
   call budget. As a result, the subsequent loop reaches the maximum limit
   prematurely, leading to a test failure where the actual loop count is
   32 instead of the expected 33.

Fix this by deferring the counter update. Change the branch condition to
BPF_JSGE (greater or equal) so that we check the boundary first. The TCC
is only incremented and stored back to memory after the boundary check
and the NULL-target check both pass.

Before:

  $ sudo ./test_progs -t tailcalls/tailcall_3
  ...
  test_tailcall_count:FAIL:tailcall count unexpected tailcall count: actual 32 != expected 33
  ...
  #465/3   tailcalls/tailcall_3:FAIL
  #465     tailcalls:FAIL

After:

  $ sudo ./test_progs -t tailcalls/tailcall_3
  #465/3   tailcalls/tailcall_3:OK
  #465     tailcalls:OK
  Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED

Fixes: c0fcc95 ("LoongArch: BPF: Fix the tailcall hierarchy")
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants