[26.04_linux-nvidia-bos] linux-nvidia-7.0-bos: Reduce SMT contention on Vera by arighi · Pull Request #480 · NVIDIA/NV-Kernels

arighi · 2026-06-30T19:25:35Z

On Vera systems, concurrent task wakeups can place CPU-intensive work on sibling threads of the same SMT core while other physical cores remain idle. This SMT contention can reduce performance for workloads intended to run one thread per core.

We can prevent such contention stabilizing the idle-core CPU selection, so that concurrent wakeups consistently select the same SMT-core representative, allowing the scheduler to avoid unintended sibling placement.

This approach has been proposed upstream:
https://lore.kernel.org/all/20260630152747.128746-1-arighi@nvidia.com/

We may have more follow-ups on this, but for now applying this patch seems to fix the issue and can significantly improve performance of CPU-bound workloads.

LP: https://bugs.launchpad.net/ubuntu/+bug/2158811

github-actions · 2026-06-30T19:36:56Z

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ❌ Errors found

Details

Checking 2 commits...

Cherry-pick digest:
E: 7c08b4b6f01f ("NVIDIA: VR: SAUCE: sched/fair: Stabilize"): diff MISMATCH with lore patch (add [Author: reason] annotation if intentional)
E: 2bc6393f5d13 ("topology: Introduce cpu_smt_mask for CON"): cannot resolve upstream SHA 8d3ad4bd922e
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject                              │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 7c08b4b6f01f │ sched/fair: stabilize idle smt core selection with asym-capacity │ MISMATCH   │ found   │ ok, backporter: arighi    │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 2bc6393f5d13 │ 8d3ad4bd922e                                                     │ ERROR      │ ERROR   │ ERROR                     │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

nirmoy · 2026-06-30T20:43:40Z

BaseOS Kernel Review

Summary

Only low-severity documentation issues were found. Commit 2bc6393 has a misleading comment placement suggesting cpu_smt_mask always returns cpumask_of(cpu), when the CONFIG_SCHED_SMT=y branch actually returns topology_sibling_cpumask(cpu), plus a minor grammar typo. No functional issues were identified.

Findings: Critical: 0, High: 0, Medium: 0, Low: 2

Latest watcher review: open review

Generated test plan: open test plan

Kernel deb build: successful (download debs, 4 files)

Head: 7c08b4b6f01f

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review.

jamieNguyenNVIDIA · 2026-06-30T22:28:02Z

+ */
+static int select_idle_core_cpu(int cpu, const struct cpumask *cpus)
+{
+	int sibling = cpumask_first_and(cpu_smt_mask(cpu), cpus);


Claude finding:

Does this build with CONFIG_SCHED_SMT=n?

cpu_smt_mask() is only declared when CONFIG_SCHED_SMT is set. In
include/linux/topology.h the definition is guarded:

#if defined(CONFIG_SCHED_SMT) && !defined(cpu_smt_mask) static inline const struct cpumask *cpu_smt_mask(int cpu) { return topology_sibling_cpumask(cpu); } #endif

select_idle_core_cpu() is added after the file's
#endif /* !CONFIG_SCHED_SMT */ block, so it is compiled for every
CONFIG_SMP build regardless of CONFIG_SCHED_SMT. is_core_idle(), called
just below in the same scan, wraps its own cpu_smt_mask() use in
#ifdef CONFIG_SCHED_SMT for this reason.

On arm64 CONFIG_SCHED_SMT is "default y" but user-selectable; it is not
force-selected the way x86 does with "select SCHED_SMT if SMP", so
CONFIG_SMP=y with CONFIG_SCHED_SMT=n is a valid configuration. Building
arm64 defconfig with CONFIG_SCHED_SMT disabled stops at:

kernel/sched/fair.c: In function 'select_idle_core_cpu': error: implicit declaration of function 'cpu_smt_mask'; did you mean 'cpu_all_mask'? [-Werror=implicit-function-declaration]

The same configuration builds on the parent commit.

Would guarding the cpu_smt_mask() use with CONFIG_SCHED_SMT, returning
cpu when SMT is not built (mirroring is_core_idle()), avoid this?

Ah yes, this looks legit and also the patch I sent upstream is affected. I'll send a v2 and update this PR as well. Thanks!

Actually the upstream patch isn't affected because of: 815c5cb76a3e (topology: Introduce cpu_smt_mask for CONFIG_SCHED_SMT=n). We should probably apply this commit as well, so we keep this patch aligned with the upstream one and we reduce the risk of conflicts.

nvmochs · 2026-06-30T22:38:33Z

@arighi In addition to the config finding that Jamie reported, I also found this with codex:

• P2 is about a mismatch between what the scan validates and what the new helper may return.

In select_idle_capacity(), each scanned cpu is accepted only after this check:

  if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
          continue;

So for the scanned CPU, the code proves either:

the rq is truly available for enqueueing via available_idle_cpu(cpu), which includes !vcpu_is_preempted(cpu), or
the rq only has SCHED_IDLE tasks via sched_idle_cpu(cpu).

With the patch, when that CPU is on an idle SMT core, the function no longer necessarily returns that validated CPU. It calls:

  select_idle_core_cpu(cpu, cpus)

which returns:

  cpumask_first_and(cpu_smt_mask(cpu), cpus)

That sibling is only constrained by affinity/domain membership. It is not rechecked with available_idle_cpu() or sched_idle_cpu().

Why that matters: idle_core comes from is_core_idle(cpu), and is_core_idle() checks siblings with idle_cpu(), not available_idle_cpu(). idle_cpu() is weaker: it does not reject vcpu_is_preempted(cpu). So on architectures where vcpu_is_preempted() can return true, a sibling can be “idle” by is_core_idle() but still not be considered available by the normal wakeup selection predicate.

A concrete shape:

CPU 5 is scanned.
CPU 5 passes available_idle_cpu(5) and util_fits_cpu(...).
Its SMT sibling CPU 4 is in cpus and has an idle rq, so is_core_idle(5) is true.
But CPU 4 is vCPU-preempted, so available_idle_cpu(4) would be false.
select_idle_core_cpu(5, cpus) returns CPU 4 because it is the first sibling in the mask.
The wakeup path returns CPU 4 even though the scan would have skipped CPU 4 if it had evaluated it directly.

That is why I marked it P2 rather than P1: on arm64 in this tree, vcpu_is_preempted() is a stub returning false, so it likely does not affect the stated Vera Rubin target. But the generic scheduler code still has a cross-arch predicate mismatch, especially for x86/s390/loongarch/powerpc-style environments with real vCPU preemption reporting.

The low-risk fix is to make the representative selection use the same acceptance predicate, for example choose the first sibling in cpu_smt_mask(cpu) & cpus that also passes available_idle_cpu() or sched_idle_cpu(), falling back to the already validated cpu.

Define cpu_smt_mask in case of CONFIG_SCHED_SMT=n as cpumask_of that CPU. With that config, it is expected that kernel treats each CPU as individual core. Using cpumask_of(cpu) reflects that. This would help to get rid of the ifdeffery that is spread across the codebase since cpu_smt_mask is defined only in case of CONFIG_SCHED_SMT=y. Note: There is no arch today which defines cpu_smt_mask unconditionally. So likely defining the cpu_smt_mask shouldn't lead redefinition errors. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260515172456.542799-2-sshegde@linux.ibm.com (cherry picked from commit 8d3ad4bd922e94834d5a5e1c46df625fd21a4acb) Signed-off-by: Andrea Righi <arighi@nvidia.com>

… asym-capacity select_idle_capacity() scans all logical CPUs also when it is looking for a fully idle SMT core. Two concurrent wakeups can therefore observe the same core as idle, encounter different siblings first, and place one task on each sibling while another core remains unused. Make every logical CPU of a selected idle core resolve to the same stable CPU representative within the scan's existing affinity and scheduling-domain mask. If the first task is enqueued before the next scan examines the core, that scan rejects the now-busy core. If both scans observe the core as idle, they select the same runqueue even if the first enqueue becomes visible before the second scan finishes, exposing the imbalance to the load balancer. The symmetric CPU idle selection path is subject to the same race, but normally returns as soon as select_idle_core() finds a fully idle core, reducing the conflict window. The per-CPU capacity scan can retain an idle-core candidate while evaluating other CPUs, giving concurrent wakeups more opportunity to select different siblings of the same SMT core. Therefore, limit the normalization to the asym-capacity path, where this behavior has a measurable impact. On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per core) showed a consistent 23% increase in mean throughput across multiple runs. For comparison, DCPerf MediaWiki running at system saturation (with all SMT siblings busy) showed neither a benefit nor a regression: throughput and Nginx request latency remained within measurement error. Likewise, schbench under partially idle conditions showed no material change in wakeup latency, request latency, or throughput (within 0.1%). Tail wakeup latency was more consistent across runs with this change applied. Signed-off-by: Andrea Righi <arighi@nvidia.com> [ arighi: adjust context for v7.0 ] (backported from https://lore.kernel.org/all/20260630152747.128746-1-arighi@nvidia.com) Signed-off-by: Andrea Righi <arighi@nvidia.com>

arighi · 2026-07-01T06:00:12Z

@nvmochs this one looks legit also for the upstream patch, we should properly handle the vcpu_is_preempted() scenario. I'll send a v2 upstream and update this PR. Thanks!

arighi · 2026-07-01T09:17:57Z

@nvmochs this one looks legit also for the upstream patch, we should properly handle the vcpu_is_preempted() scenario. I'll send a v2 upstream and update this PR. Thanks!

After giving it some more thought, I think the patch is already in good shape as it is respect to the issue found by codex.

Here's why. It's true that we should consider also vcpu_is_preempted() to determine the "idleness" of a core, but relying on a check like "runqueue is SCHED_IDLE && !vcpu_is_preempted()" would just re-introduce the race that we're trying to solve (together with extra overhead for all the other cases).

What we need, instead, is a fast way to get a stable representative of a "presumed" idle core. The logic doesn't need to be perfect, it just needs to be efficient to prevent most of the SMT contention scenarios without introducing too much overhead. Even if we end up selecting a vCPU preempted sibling, the load balancer can subsequently correct the placement if it creates a visible imbalance. This looks more efficient than trying to make a perfect decision on each task wakeup, checking SCHED_IDLE && vCPU not preempted.

So, yes, Codex is correct, but also wrong. :) The goal isn't to make the logic 100% formally correct, it's to prevent most SMT contention scenarios in a reasonably efficient way.

That said, let's see what the other scheduler maintainers think. For now, though, I'd keep the logic as it is.

arighi · 2026-07-01T09:24:52Z

PR updated, including 8d3ad4bd922e ("topology: Introduce cpu_smt_mask for CONFIG_SCHED_SMT=n") to fix the potential build failure with CONFIG_SCHED_SMT=n.

jamieNguyenNVIDIA · 2026-07-01T15:34:31Z

Acked-by: Jamie Nguyen <jamien@nvidia.com>

nvmochs · 2026-07-01T16:06:36Z

@nvmochs this one looks legit also for the upstream patch, we should properly handle the vcpu_is_preempted() scenario. I'll send a v2 upstream and update this PR. Thanks!

After giving it some more thought, I think the patch is already in good shape as it is respect to the issue found by codex.

Here's why. It's true that we should consider also vcpu_is_preempted() to determine the "idleness" of a core, but relying on a check like "runqueue is SCHED_IDLE && !vcpu_is_preempted()" would just re-introduce the race that we're trying to solve (together with extra overhead for all the other cases).

What we need, instead, is a fast way to get a stable representative of a "presumed" idle core. The logic doesn't need to be perfect, it just needs to be efficient to prevent most of the SMT contention scenarios without introducing too much overhead. Even if we end up selecting a vCPU preempted sibling, the load balancer can subsequently correct the placement if it creates a visible imbalance. This looks more efficient than trying to make a perfect decision on each task wakeup, checking SCHED_IDLE && vCPU not preempted.

So, yes, Codex is correct, but also wrong. :) The goal isn't to make the logic 100% formally correct, it's to prevent most SMT contention scenarios in a reasonably efficient way.

That said, let's see what the other scheduler maintainers think. For now, though, I'd keep the logic as it is.

Thanks Andrea...I agree with this approach.

nvmochs · 2026-07-01T16:08:24Z

Re-reviewed and confirmed the new patch addresses the kconfig finding.

No further issues/concerns from me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@arighi FYI - I added a row to track this in the SS.

nirmoy added the help wanted Extra attention is needed label Jun 30, 2026

arighi requested a review from nvidia-bfigg June 30, 2026 19:31

arighi assigned nvmochs and jamieNguyenNVIDIA and unassigned jamieNguyenNVIDIA and nvmochs Jun 30, 2026

arighi requested review from jamieNguyenNVIDIA and nvmochs June 30, 2026 19:31

jamieNguyenNVIDIA reviewed Jun 30, 2026

View reviewed changes

Shrikanth Hegde and others added 2 commits July 1, 2026 07:40

arighi force-pushed the bos-7.0-next-asym-cpu-stabilize-smt branch from a387869 to 7c08b4b Compare July 1, 2026 09:23

nirmoy added the has_1_ack label Jul 1, 2026

nirmoy added has_2_acks and removed help wanted Extra attention is needed has_1_ack labels Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[26.04_linux-nvidia-bos] linux-nvidia-7.0-bos: Reduce SMT contention on Vera#480

[26.04_linux-nvidia-bos] linux-nvidia-7.0-bos: Reduce SMT contention on Vera#480
arighi wants to merge 2 commits into
NVIDIA:26.04_linux-nvidia-bosfrom
arighi:bos-7.0-next-asym-cpu-stabilize-smt

arighi commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

nirmoy commented Jun 30, 2026 •

edited

Loading

Uh oh!

jamieNguyenNVIDIA Jun 30, 2026

Uh oh!

arighi Jul 1, 2026

Uh oh!

arighi Jul 1, 2026

Uh oh!

nvmochs commented Jun 30, 2026

Uh oh!

arighi commented Jul 1, 2026

Uh oh!

arighi commented Jul 1, 2026

Uh oh!

arighi commented Jul 1, 2026

Uh oh!

jamieNguyenNVIDIA commented Jul 1, 2026

Uh oh!

nvmochs commented Jul 1, 2026

Uh oh!

nvmochs commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

arighi commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Validation Report

Patchscan ✅ No Missing Fixes

PR Lint ❌ Errors found

Uh oh!

nirmoy commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BaseOS Kernel Review

Summary

Uh oh!

jamieNguyenNVIDIA Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

arighi Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

arighi Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

nvmochs commented Jun 30, 2026

Uh oh!

arighi commented Jul 1, 2026

Uh oh!

arighi commented Jul 1, 2026

Uh oh!

arighi commented Jul 1, 2026

Uh oh!

jamieNguyenNVIDIA commented Jul 1, 2026

Uh oh!

nvmochs commented Jul 1, 2026

Uh oh!

nvmochs commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented Jun 30, 2026 •

edited

Loading

nirmoy commented Jun 30, 2026 •

edited

Loading