Skip to content

Avoid reporting crash results if it is like GPU problem#7

Open
Menkib64 wants to merge 1 commit into
LeelaChessZero:masterfrom
Menkib64:avoid_reporting_hw_crashes
Open

Avoid reporting crash results if it is like GPU problem#7
Menkib64 wants to merge 1 commit into
LeelaChessZero:masterfrom
Menkib64:avoid_reporting_hw_crashes

Conversation

@Menkib64
Copy link
Copy Markdown

@Menkib64 Menkib64 commented May 7, 2026

No description provided.

Copilot AI review requested due to automatic review settings May 7, 2026 21:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce false “engine crash” reporting by detecting suspected GPU/hardware instability (via a base-engine benchmark check) and suppressing result reporting for affected game pairs.

Changes:

  • Extends Cutechess result parsing to run a “GPU crash?” check on disconnect/stall adjudications and mark affected games as hw_crash.
  • Suppresses pentanomial/trinomial updates for pairs containing hw_crash by deleting the pair from the in-memory game map.
  • Threads base_name / base_network through the cutechess worker pipeline so the benchmark check has the required inputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Client/worker.py
Comment on lines 530 to 533
results['crashes' ] += 'disconnect' in reason or 'stalls' in reason
results['timelosses'] += 'on time' in reason
results['illegals' ] += 'illegal' in reason

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally want to increment crash count for HW errors too. It makes sure we will notice to check Errors page.

Comment thread Client/worker.py
Comment on lines +517 to +524
def is_gpu_crashed(config, engine, network):
print('[WARNING] Checking if crash was caused by a GPU problem...')
try:
safe_run_benchmarks(config, 'base', engine, network)
return False
except utils.OpenBenchBadBenchException:
print('[ERROR] GPU crash detected!')
return True
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the server error report (and blacklisting) after benchmark fails. It was only about not reporting game results.

Comment thread Client/worker.py
Comment on lines +517 to +523
def is_gpu_crashed(config, engine, network):
print('[WARNING] Checking if crash was caused by a GPU problem...')
try:
safe_run_benchmarks(config, 'base', engine, network)
return False
except utils.OpenBenchBadBenchException:
print('[ERROR] GPU crash detected!')
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think crash handling performance matters. Client is either going to stop working or test is about to fail.

Comment thread Client/worker.py
return

# Don't report results when we detect a GPU issue.
if results['games'][first] == 'hw_crash' or results['games'][second] == 'hw_crash':
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants