Skip to content

Replace Hangfire build job runner with a local implementation#947

Open
ddaspit wants to merge 8 commits into
mainfrom
remove-hangfire-machine
Open

Replace Hangfire build job runner with a local implementation#947
ddaspit wants to merge 8 commits into
mainfrom
remove-hangfire-machine

Conversation

@ddaspit
Copy link
Copy Markdown
Contributor

@ddaspit ddaspit commented May 15, 2026

This change is Reviewable

@ddaspit ddaspit force-pushed the remove-hangfire-machine branch from 043389d to a490550 Compare May 15, 2026 13:54
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 15, 2026

Codecov Report

❌ Patch coverage is 60.69652% with 158 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.67%. Comparing base (22f023f) to head (e8c58e6).

Files with missing lines Patch % Lines
...val.Machine.Shared/Services/LocalBuildJobRunner.cs 73.46% 63 Missing and 2 partials ⚠️
...Shared/Services/StatisticalLocalBuildJobFactory.cs 2.77% 35 Missing ⚠️
...Shared/Services/SmtTransferLocalBuildJobFactory.cs 2.85% 34 Missing ⚠️
...rc/Echo/src/EchoEngine/TranslationEngineService.cs 0.00% 10 Missing ⚠️
...Machine.Shared/Services/NmtLocalBuildJobFactory.cs 85.29% 2 Missing and 3 partials ⚠️
...src/Serval.Translation/Services/PlatformService.cs 0.00% 3 Missing ⚠️
...ine/src/Serval.Machine.Shared/Services/BuildJob.cs 66.66% 2 Missing ⚠️
...l.Machine.Shared/Services/ClearMLBuildJobRunner.cs 66.66% 1 Missing ⚠️
...achine.Shared/Services/SmtTransferEngineService.cs 0.00% 1 Missing ⚠️
...achine.Shared/Services/StatisticalEngineService.cs 0.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #947      +/-   ##
==========================================
- Coverage   66.74%   64.67%   -2.08%     
==========================================
  Files         381      379       -2     
  Lines       21736    21634     -102     
  Branches     2773     2762      -11     
==========================================
- Hits        14507    13991     -516     
- Misses       6191     6619     +428     
+ Partials     1038     1024      -14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Enkidu93 Enkidu93 force-pushed the remove-hangfire-machine branch from 5cd22da to 0c064cc Compare May 20, 2026 15:46
@Enkidu93
Copy link
Copy Markdown
Collaborator

I ended up only making a few updates:

  • The wrong runner was being used for SMT training
  • Fixed a few small bugs in the local SMT training job
  • Removed unused data parameter (this was something in my PR that wasn't in yours - we could also strongly type the build data but since it's ephemeral, I don't know that it matters. Happy to do so if you prefer)
  • Explicitly set the machine.py version in the docker-compose. Is there a reason we haven't done this before? I think it's easy enough to set the version in a branch if you want to test a testing-only version of machine.py. This also makes it easy when pushing to QA because I can just use whatever version of machine.py is in the docker-compose in that commit and feel confident that it is the correct one that is compatible with the code currently on main.

@Enkidu93 Enkidu93 marked this pull request as ready for review May 20, 2026 15:53
@Enkidu93 Enkidu93 requested a review from pmachapman May 20, 2026 15:53
Copy link
Copy Markdown
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit partially reviewed 43 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on pmachapman).


src/Machine/src/Serval.Machine.Shared/Models/Build.cs line 13 at r3 (raw file):

public enum BuildJobRunnerType
{
    Hangfire,

I think we can safely remove this type. We will need to make sure that there aren't any currently running builds before we deploy.

Copy link
Copy Markdown
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: I don't think I can approve this PR, since I created it.

@ddaspit reviewed 1 file and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on pmachapman).

@Enkidu93
Copy link
Copy Markdown
Collaborator

Fixes #950 and fixes #939

Copy link
Copy Markdown
Collaborator

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmachapman partially reviewed 43 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on ddaspit).


docker-compose.yml line 27 at r3 (raw file):

      - BuildJob__ClearML__1__DockerImage=${MACHINE_PY_CPU_IMAGE:-ghcr.io/sillsdev/machine.py:1.12.0.cpu_only}
      - BuildJob__ClearML__2__Queue=${CLEARML_CPU_QUEUE:-lambert_24gb.cpu_only}
      - BuildJob__ClearML__2__DockerImage=${MACHINE_PY_CPU_IMAGE:-ghcr.io/sillsdev/machine.py:1.12.0.cpu_only}

Do we want to specify the version of machine.py from now on (if so, we should update appsettings.json and docker-compose.atlas.yml), or was this just a workaround?

Code quote:

      - BuildJob__ClearML__0__DockerImage=${MACHINE_PY_IMAGE:-ghcr.io/sillsdev/machine.py:1.12.0}
      - BuildJob__ClearML__1__Queue=${CLEARML_CPU_QUEUE:-lambert_24gb.cpu_only}
      - BuildJob__ClearML__1__DockerImage=${MACHINE_PY_CPU_IMAGE:-ghcr.io/sillsdev/machine.py:1.12.0.cpu_only}
      - BuildJob__ClearML__2__Queue=${CLEARML_CPU_QUEUE:-lambert_24gb.cpu_only}
      - BuildJob__ClearML__2__DockerImage=${MACHINE_PY_CPU_IMAGE:-ghcr.io/sillsdev/machine.py:1.12.0.cpu_only}

src/Machine/src/Serval.Machine.Shared/Services/SmtTransferLocalBuildJobFactory.cs line 59 at r3 (raw file):

    }

    private record PostprocessData(int TrainCount, double Confidence);

NIT: Should this record be public? I notice the all three LocalBuildJobFactories declare this exact record.

Code quote:

private record PostprocessData(int TrainCount, double Confidence);

Copy link
Copy Markdown
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit reviewed 5 files and all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on ddaspit).

Copy link
Copy Markdown
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 partially reviewed 36 files, made 3 comments, and resolved 1 discussion.
Reviewable status: 46 of 49 files reviewed, 1 unresolved discussion (waiting on pmachapman).


docker-compose.yml line 27 at r3 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

Do we want to specify the version of machine.py from now on (if so, we should update appsettings.json and docker-compose.atlas.yml), or was this just a workaround?

I was thinking we could switch to explicitly setting it. I mention my rationale in a top-level comment. I went ahead and updated it in the appsettings as well - good call! Do you think this is a good change? I can't remember why we've left it as latest in the past.


src/Machine/src/Serval.Machine.Shared/Services/SmtTransferLocalBuildJobFactory.cs line 59 at r3 (raw file):

Previously, pmachapman (Peter Chapman) wrote…

NIT: Should this record be public? I notice the all three LocalBuildJobFactories declare this exact record.

Done. I went ahead and made it public. Feel free to push back if there's a reason for having it defined separately, @ddaspit.


src/Echo/src/EchoEngine/TranslationEngineService.cs line 183 at r6 (raw file):

                try
                {
                    //Wait for build to exist in the database before starting the build.

We will want to remove this once we move Echo to Serval.Machine.

Copy link
Copy Markdown
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit reviewed 6 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on Enkidu93 and pmachapman).


src/Serval/src/Serval.Translation/Services/PlatformService.cs line 236 at r6 (raw file):

    public async Task<bool> BuildExistsAsync(string buildId, CancellationToken cancellationToken = default)
    {
        Build? build = await _builds.GetAsync(b => b.Id == buildId, cancellationToken);

I think there is an ExistsAsync method on IRepository.

@Enkidu93
Copy link
Copy Markdown
Collaborator

src/Serval/src/Serval.Translation/Services/PlatformService.cs line 236 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think there is an ExistsAsync method on IRepository.

Done. Yep 🤦‍♂️

Copy link
Copy Markdown
Contributor Author

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit reviewed 1 file and all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on pmachapman).

@Enkidu93 Enkidu93 force-pushed the remove-hangfire-machine branch from a6f17fc to e8c58e6 Compare May 21, 2026 15:33
Copy link
Copy Markdown
Collaborator

@pmachapman pmachapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@pmachapman reviewed 11 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on Enkidu93).


docker-compose.yml line 27 at r3 (raw file):

Do you think this is a good change?

I think so. If a PR requires a special point release build of machine.py, it should probably define it in its YAML and appSettings.json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants