Skip to content

[26.04_linux-nvidia-bos] linux-nvidia-7.0-bos: Reduce SMT contention on Vera#480

Open
arighi wants to merge 2 commits into
NVIDIA:26.04_linux-nvidia-bosfrom
arighi:bos-7.0-next-asym-cpu-stabilize-smt
Open

[26.04_linux-nvidia-bos] linux-nvidia-7.0-bos: Reduce SMT contention on Vera#480
arighi wants to merge 2 commits into
NVIDIA:26.04_linux-nvidia-bosfrom
arighi:bos-7.0-next-asym-cpu-stabilize-smt

Conversation

@arighi

@arighi arighi commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

On Vera systems, concurrent task wakeups can place CPU-intensive work on sibling threads of the same SMT core while other physical cores remain idle. This SMT contention can reduce performance for workloads intended to run one thread per core.

We can prevent such contention stabilizing the idle-core CPU selection, so that concurrent wakeups consistently select the same SMT-core representative, allowing the scheduler to avoid unintended sibling placement.

This approach has been proposed upstream:
https://lore.kernel.org/all/20260630152747.128746-1-arighi@nvidia.com/

We may have more follow-ups on this, but for now applying this patch seems to fix the issue and can significantly improve performance of CPU-bound workloads.


LP: https://bugs.launchpad.net/ubuntu/+bug/2158811

@nirmoy nirmoy added the help wanted Extra attention is needed label Jun 30, 2026
@arighi arighi requested a review from nvidia-bfigg June 30, 2026 19:31
@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ❌ Errors found

Details
Checking 2 commits...

Cherry-pick digest:
E: 7c08b4b6f01f ("NVIDIA: VR: SAUCE: sched/fair: Stabilize"): diff MISMATCH with lore patch (add [Author: reason] annotation if intentional)
E: 2bc6393f5d13 ("topology: Introduce cpu_smt_mask for CON"): cannot resolve upstream SHA 8d3ad4bd922e
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject                              │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 7c08b4b6f01f │ sched/fair: stabilize idle smt core selection with asym-capacity │ MISMATCH   │ found   │ ok, backporter: arighi    │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 2bc6393f5d13 │ 8d3ad4bd922e                                                     │ ERROR      │ ERROR   │ ERROR                     │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

@nirmoy

nirmoy commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

BaseOS Kernel Review

Summary

Only low-severity documentation issues were found. Commit 2bc6393 has a misleading comment placement suggesting cpu_smt_mask always returns cpumask_of(cpu), when the CONFIG_SCHED_SMT=y branch actually returns topology_sibling_cpumask(cpu), plus a minor grammar typo. No functional issues were identified.

Findings: Critical: 0, High: 0, Medium: 0, Low: 2

Latest watcher review: open review

Generated test plan: open test plan

Kernel deb build: successful (download debs, 4 files)

Head: 7c08b4b6f01f

This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review.

Comment thread kernel/sched/fair.c
*/
static int select_idle_core_cpu(int cpu, const struct cpumask *cpus)
{
int sibling = cpumask_first_and(cpu_smt_mask(cpu), cpus);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finding:

Does this build with CONFIG_SCHED_SMT=n?

cpu_smt_mask() is only declared when CONFIG_SCHED_SMT is set. In
include/linux/topology.h the definition is guarded:

#if defined(CONFIG_SCHED_SMT) && !defined(cpu_smt_mask)
static inline const struct cpumask *cpu_smt_mask(int cpu)
{
	return topology_sibling_cpumask(cpu);
}
#endif

select_idle_core_cpu() is added after the file's
#endif /* !CONFIG_SCHED_SMT */ block, so it is compiled for every
CONFIG_SMP build regardless of CONFIG_SCHED_SMT. is_core_idle(), called
just below in the same scan, wraps its own cpu_smt_mask() use in
#ifdef CONFIG_SCHED_SMT for this reason.

On arm64 CONFIG_SCHED_SMT is "default y" but user-selectable; it is not
force-selected the way x86 does with "select SCHED_SMT if SMP", so
CONFIG_SMP=y with CONFIG_SCHED_SMT=n is a valid configuration. Building
arm64 defconfig with CONFIG_SCHED_SMT disabled stops at:

kernel/sched/fair.c: In function 'select_idle_core_cpu':
error: implicit declaration of function 'cpu_smt_mask';
did you mean 'cpu_all_mask'?
[-Werror=implicit-function-declaration]

The same configuration builds on the parent commit.

Would guarding the cpu_smt_mask() use with CONFIG_SCHED_SMT, returning
cpu when SMT is not built (mirroring is_core_idle()), avoid this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, this looks legit and also the patch I sent upstream is affected. I'll send a v2 and update this PR as well. Thanks!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the upstream patch isn't affected because of: 815c5cb76a3e (topology: Introduce cpu_smt_mask for CONFIG_SCHED_SMT=n). We should probably apply this commit as well, so we keep this patch aligned with the upstream one and we reduce the risk of conflicts.

@nvmochs

nvmochs commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

@arighi In addition to the config finding that Jamie reported, I also found this with codex:

• P2 is about a mismatch between what the scan validates and what the new helper may return.

In select_idle_capacity(), each scanned cpu is accepted only after this check:

  if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
          continue;

So for the scanned CPU, the code proves either:

  • the rq is truly available for enqueueing via available_idle_cpu(cpu), which includes !vcpu_is_preempted(cpu), or
  • the rq only has SCHED_IDLE tasks via sched_idle_cpu(cpu).

With the patch, when that CPU is on an idle SMT core, the function no longer necessarily returns that validated CPU. It calls:

  select_idle_core_cpu(cpu, cpus)

which returns:

  cpumask_first_and(cpu_smt_mask(cpu), cpus)

That sibling is only constrained by affinity/domain membership. It is not rechecked with available_idle_cpu() or sched_idle_cpu().

Why that matters: idle_core comes from is_core_idle(cpu), and is_core_idle() checks siblings with idle_cpu(), not available_idle_cpu(). idle_cpu() is weaker: it does not reject vcpu_is_preempted(cpu). So on architectures where vcpu_is_preempted() can return true, a sibling can be “idle” by is_core_idle() but still not be considered available by the normal wakeup selection predicate.

A concrete shape:

  1. CPU 5 is scanned.
  2. CPU 5 passes available_idle_cpu(5) and util_fits_cpu(...).
  3. Its SMT sibling CPU 4 is in cpus and has an idle rq, so is_core_idle(5) is true.
  4. But CPU 4 is vCPU-preempted, so available_idle_cpu(4) would be false.
  5. select_idle_core_cpu(5, cpus) returns CPU 4 because it is the first sibling in the mask.
  6. The wakeup path returns CPU 4 even though the scan would have skipped CPU 4 if it had evaluated it directly.

That is why I marked it P2 rather than P1: on arm64 in this tree, vcpu_is_preempted() is a stub returning false, so it likely does not affect the stated Vera Rubin target. But the generic scheduler code still has a cross-arch predicate mismatch, especially for x86/s390/loongarch/powerpc-style environments with real vCPU preemption reporting.

The low-risk fix is to make the representative selection use the same acceptance predicate, for example choose the first sibling in cpu_smt_mask(cpu) & cpus that also passes available_idle_cpu() or sched_idle_cpu(), falling back to the already validated cpu.

Shrikanth Hegde and others added 2 commits July 1, 2026 07:40
Define cpu_smt_mask in case of CONFIG_SCHED_SMT=n as cpumask_of that
CPU. With that config, it is expected that kernel treats each CPU
as individual core. Using cpumask_of(cpu) reflects that.

This would help to get rid of the ifdeffery that is spread across
the codebase since cpu_smt_mask is defined only in case of
CONFIG_SCHED_SMT=y.

Note: There is no arch today which defines cpu_smt_mask unconditionally.
So likely defining the cpu_smt_mask shouldn't lead redefinition errors.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260515172456.542799-2-sshegde@linux.ibm.com
(cherry picked from commit 8d3ad4bd922e94834d5a5e1c46df625fd21a4acb)
Signed-off-by: Andrea Righi <arighi@nvidia.com>
… asym-capacity

select_idle_capacity() scans all logical CPUs also when it is looking
for a fully idle SMT core. Two concurrent wakeups can therefore observe
the same core as idle, encounter different siblings first, and place one
task on each sibling while another core remains unused.

Make every logical CPU of a selected idle core resolve to the same
stable CPU representative within the scan's existing affinity and
scheduling-domain mask. If the first task is enqueued before the next
scan examines the core, that scan rejects the now-busy core. If both
scans observe the core as idle, they select the same runqueue even if
the first enqueue becomes visible before the second scan finishes,
exposing the imbalance to the load balancer.

The symmetric CPU idle selection path is subject to the same race, but
normally returns as soon as select_idle_core() finds a fully idle core,
reducing the conflict window. The per-CPU capacity scan can retain an
idle-core candidate while evaluating other CPUs, giving concurrent
wakeups more opportunity to select different siblings of the same SMT
core. Therefore, limit the normalization to the asym-capacity path,
where this behavior has a measurable impact.

On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a
CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per
core) showed a consistent 23% increase in mean throughput across
multiple runs.

For comparison, DCPerf MediaWiki running at system saturation (with all
SMT siblings busy) showed neither a benefit nor a regression: throughput
and Nginx request latency remained within measurement error.

Likewise, schbench under partially idle conditions showed no material
change in wakeup latency, request latency, or throughput (within 0.1%).
Tail wakeup latency was more consistent across runs with this change
applied.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
[ arighi: adjust context for v7.0 ]
(backported from https://lore.kernel.org/all/20260630152747.128746-1-arighi@nvidia.com)
Signed-off-by: Andrea Righi <arighi@nvidia.com>
@arighi

arighi commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

@nvmochs this one looks legit also for the upstream patch, we should properly handle the vcpu_is_preempted() scenario. I'll send a v2 upstream and update this PR. Thanks!

@arighi

arighi commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

@nvmochs this one looks legit also for the upstream patch, we should properly handle the vcpu_is_preempted() scenario. I'll send a v2 upstream and update this PR. Thanks!

After giving it some more thought, I think the patch is already in good shape as it is respect to the issue found by codex.

Here's why. It's true that we should consider also vcpu_is_preempted() to determine the "idleness" of a core, but relying on a check like "runqueue is SCHED_IDLE && !vcpu_is_preempted()" would just re-introduce the race that we're trying to solve (together with extra overhead for all the other cases).

What we need, instead, is a fast way to get a stable representative of a "presumed" idle core. The logic doesn't need to be perfect, it just needs to be efficient to prevent most of the SMT contention scenarios without introducing too much overhead. Even if we end up selecting a vCPU preempted sibling, the load balancer can subsequently correct the placement if it creates a visible imbalance. This looks more efficient than trying to make a perfect decision on each task wakeup, checking SCHED_IDLE && vCPU not preempted.

So, yes, Codex is correct, but also wrong. :) The goal isn't to make the logic 100% formally correct, it's to prevent most SMT contention scenarios in a reasonably efficient way.

That said, let's see what the other scheduler maintainers think. For now, though, I'd keep the logic as it is.

@arighi arighi force-pushed the bos-7.0-next-asym-cpu-stabilize-smt branch from a387869 to 7c08b4b Compare July 1, 2026 09:23
@arighi

arighi commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

PR updated, including 8d3ad4bd922e ("topology: Introduce cpu_smt_mask for CONFIG_SCHED_SMT=n") to fix the potential build failure with CONFIG_SCHED_SMT=n.

@jamieNguyenNVIDIA

Copy link
Copy Markdown
Collaborator

Acked-by: Jamie Nguyen <jamien@nvidia.com>

@nvmochs

nvmochs commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

@nvmochs this one looks legit also for the upstream patch, we should properly handle the vcpu_is_preempted() scenario. I'll send a v2 upstream and update this PR. Thanks!

After giving it some more thought, I think the patch is already in good shape as it is respect to the issue found by codex.

Here's why. It's true that we should consider also vcpu_is_preempted() to determine the "idleness" of a core, but relying on a check like "runqueue is SCHED_IDLE && !vcpu_is_preempted()" would just re-introduce the race that we're trying to solve (together with extra overhead for all the other cases).

What we need, instead, is a fast way to get a stable representative of a "presumed" idle core. The logic doesn't need to be perfect, it just needs to be efficient to prevent most of the SMT contention scenarios without introducing too much overhead. Even if we end up selecting a vCPU preempted sibling, the load balancer can subsequently correct the placement if it creates a visible imbalance. This looks more efficient than trying to make a perfect decision on each task wakeup, checking SCHED_IDLE && vCPU not preempted.

So, yes, Codex is correct, but also wrong. :) The goal isn't to make the logic 100% formally correct, it's to prevent most SMT contention scenarios in a reasonably efficient way.

That said, let's see what the other scheduler maintainers think. For now, though, I'd keep the logic as it is.

Thanks Andrea...I agree with this approach.

@nvmochs

nvmochs commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Re-reviewed and confirmed the new patch addresses the kconfig finding.

No further issues/concerns from me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>


@arighi FYI - I added a row to track this in the SS.

@nirmoy nirmoy added has_2_acks and removed help wanted Extra attention is needed has_1_ack labels Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants