[24.04-linux-nvidia-6.17] MPAM: Please pull arm_mpam: Consider overflow in bandwidth counter state#446
Conversation
PR Validation ReportPatchscan ✅ No Missing FixesAll cherry-picked commits checked — no missing upstream fixes found. PR Lint ❌ Errors foundDetailsChecking 2 commits...
Cherry-pick digest:
E: 196342878e56 ("arm_mpam: Use long MBWU counters if supp"): patch-ID mismatch with upstream 9e5afb7c3283
E: 7667ba684641 ("arm_mpam: Consider overflow in bandwidth"): patch-ID mismatch with upstream b35363793291
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local │ Referenced upstream / Patch subject │ Patch-ID │ Subject │ SoB chain │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 196342878e56 │ 9e5afb7c3283 arm_mpam: Use long MBWU counters if supported │ MISMATCH │ match │ preserved + fenghuay adde │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 7667ba684641 │ b35363793291 arm_mpam: Consider overflow in bandwidth counter st │ MISMATCH │ match │ preserved + fenghuay adde │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘
Lint: all checks passed.
|
BaseOS Kernel ReviewSummaryIn commit 7667ba6, mpam_msmon_overflow_val() always returns the 31-bit width constant regardless of counter type, so the overflow correction for 44- and 63-bit MBWU counters is too small and corrupts bandwidth accounting on wrap. The companion long-counter commit had no findings. Findings: Critical: 0, High: 1, Medium: 0, Low: 0 Latest watcher review: open review Kernel deb build: failed (failure log, build artifacts) Head: This comment is maintained by nv-pr-bot. It is updated when the GitHub watcher publishes a newer review. |
|
To clarify, this is a patch from v6.19 (not v6.18) and was part of the "[PATCH v6 00/34] arm_mpam: Add basic mpam driver". @fyu1 Are there other patches from this series that are missing from the 6.17-HWE kernel? |
|
@fyu1 Codex is calling out these findings... Line numbers below are from remotes/fyu1/24.04_linux-nvidia-6.17-next.mpam.extras.fixes3 at 65dbf0f.
At drivers/resctrl/mpam_devices.c:1383-1392: case mpam_feat_msmon_mbwu_63counter: Those are maximum counter values: 2^63 - 1, 2^44 - 1, 2^31 - 1. But after this patch, the value is added directly on overflow at :1488-1489: if (overflow) The upstream commit adds the counter modulus instead: for the 31-bit counter, upstream computes BIT_ULL(31), i.e. 2^31, not GENMASK_ULL(30, 0). So every detected overflow undercounts by one counter unit. For 44/63-bit
Existing downstream code scales the sampled MBWU value at :1480-1481: if (mpam_has_quirk(T241_MBW_COUNTER_SCALE_64, msc)) Before this backport, overflow correction also applied the same scale: overflow_val = mpam_msmon_overflow_val(m->type); The patch removed that path and now adds the unscaled helper result directly. That mixes units: now is bytes on T241, while correction is raw counter ticks. A 31-bit overflow should add roughly 2^31 * 64 bytes, but the The fix should make the overflow correction use the same unit as now.
This branch has long MBWU counter support and defines two relevant status bits: MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L /* bit 15 / clean_msmon_ctl_val() already knows about the long-counter bit and clears it for config comparison at :1337-1343: *cur_ctl &= ~MSMON_CFG_x_CTL_OFLOW_STATUS; if (FIELD_GET(MSMON_CFG_x_CTL_TYPE, *cur_ctl) == MSMON_CFG_MBWU_CTL_TYPE_MBWU) But the new overflow detection only checks bit 26 at :1435: overflow = cur_ctl & MSMON_CFG_x_CTL_OFLOW_STATUS; So a long-counter overflow signaled only by MSMON_CFG_MBWU_CTL_OFLOW_STATUS_L will not increment correction, and because overflow is false, the code also will not write the cleaned control value back to clear the sticky The backport needs to treat OFLOW_STATUS_L as overflow for the 44/63-bit MBWU paths.
At :1405: u64 now, overflow_val = 0; overflow_val is no longer used after the patch. That should produce an unused-variable warning. Also, struct msmon_mbwu_state still says prev_val is “Used to detect overflow” at drivers/resctrl/mpam_internal.h:325-326, and write_msmon_ctl_flt_vals() still resets it at mpam_devices.c:1374-1375. But this patch removed |
Claude missed 1, but also flagged 2-4. Additionally, the commit message ought to have your SOB go after the backport notes: |
fe6c6f6 to
5ecb1af
Compare
Fixed.
|
Fixed. |
10c5aaa to
d841a88
Compare
|
Rebased. No code change. |
|
This is a finding at codex: |
|
In addition to the initialization issue that Carol pointed out, my review also popped up these findings:
More detail... › In the final branch, for T241 MBWU reads: That scales now for all MBWU counter widths, including mpam_feat_msmon_mbwu_63counter. But overflow correction does: So for 63-bit counters specifically: now = scaled by 64
Lastly, it appears Jamie's comment about the ordering of the annotation notes in relation to your SOB have not been addressed - I'm still seeing the notes come after the SOB, when they should be after the pick tag, but before the SOB. Example of the ordering I was anticipating: |
One more patch is added to help the main patch. Other patches are cosmetic patches and no need to be ported. |
d841a88 to
8e479a9
Compare
Fixed. |
Fixed.
Fixed.
Fixed. |
|
The updated PR addressed comments from Mat and Carol. Please review and merge it if it's good. Thanks. -Fenghua |
|
@fyu1 Can you rebase the PR? |
|
@fyu1 All of the coding issues have been addressed. The remaining issue is commit-message ordering in the second commit 8e479a9. It still has: That should be: The first commit 3d2e100 has the ordering fixed correctly. |
8e479a9 to
1bdbe45
Compare
Rebased. |
Fixed. |
|
Thanks @fyu1! No further issues from me.
|
|
|
|
|
7a62271 to
51267da
Compare
|
@fyu1 Please rebase this branch and then we can get these applied. |
Use the overflow status bit to track overflow on each bandwidth counter read and add the counter size to the correction when overflow is detected. This assumes that only a single overflow has occurred since the last read of the counter. Overflow interrupts, on hardware that supports them could be used to remove this limitation. Cc: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (backported from commit b353637) [fenghuay: Differences from upstream b353637: - Retains NV long-counter read paths, reset_on_next_read, and T241 now *= 64. - Applies T241 overflow correction * 64 inline in __ris_msmon_read() (moved to mpam_msmon_overflow_val() wrapper in the 9e5afb7 follow-up commit). - Initialize overflow to false. ] Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
Now that the larger counter sizes are probed, make use of them. Callers of mpam_msmon_read() may not know (or care!) about the different counter sizes. Allow them to specify mpam_feat_msmon_mbwu and have the driver pick the counter to use. Only 32bit accesses to the MSC are required to be supported by the spec, but these registers are 64bits. The lower half may overflow into the higher half between two 32bit reads. To avoid this, use a helper that reads the top half multiple times to check for overflow. Signed-off-by: Rohit Mathew <rohit.mathew@arm.com> [morse: merged multiple patches from Rohit, added explicit counter selection ] Signed-off-by: James Morse <james.morse@arm.com> Cc: Peter Newman <peternewman@google.com> Reviewed-by: Ben Horgan <ben.horgan@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Fenghua Yu <fenghuay@nvidia.com> Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Tested-by: Gavin Shan <gshan@redhat.com> Tested-by: Zeng Heng <zengheng4@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Ben Horgan <ben.horgan@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (backported from commit 9e5afb7) [fenghuay: Partial port; this tree already has long-counter read paths, mpam_msc_read_mbwu_l(), and T241-MPAM-6 workaround. This commit ports per-width overflow correction, OFLOW_STATUS_L overflow detection, and T241 overflow correction scaling via mpam_msmon_overflow_val(). Apply T241-MPAM-6 * 64 scaling to 31/44-bit counters only; skip 63-bit counters for both the counter read and overflow correction. ] Signed-off-by: Fenghua Yu <fenghuay@nvidia.com>
1bdbe45 to
1963428
Compare
|
Merged, closing PR. |
Backport this commit to fix a bug: https://nvbugspro.nvidia.com/bug/6207279
Since the commit is in 6.19 upstream already, 7.0 bos and lts don't have this issue.
Use the overflow status bit to track overflow on each bandwidth counter read and add the counter size to the correction when overflow is detected.
This assumes that only a single overflow has occurred since the last read of the counter. Overflow interrupts, on hardware that supports them could be used to remove this limitation.
Cc: Zeng Heng zengheng4@huawei.com
Reviewed-by: Gavin Shan gshan@redhat.com
Reviewed-by: Zeng Heng zengheng4@huawei.com
Reviewed-by: Jonathan Cameron jonathan.cameron@huawei.com
Reviewed-by: Shaopeng Tan tan.shaopeng@jp.fujitsu.com
Reviewed-by: Fenghua Yu fenghuay@nvidia.com
Tested-by: Carl Worth carl@os.amperecomputing.com
Tested-by: Gavin Shan gshan@redhat.com
Tested-by: Zeng Heng zengheng4@huawei.com
Tested-by: Shaopeng Tan tan.shaopeng@jp.fujitsu.com
Tested-by: Hanjun Guo guohanjun@huawei.com
(backported from commit b353637)
[fenghuay: Fix mem bw monitoring counter overflow issue.
LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2157912