Skip to content

Plain MD restart support#1884

Open
jthorton wants to merge 28 commits intomainfrom
md_restarts
Open

Plain MD restart support#1884
jthorton wants to merge 28 commits intomainfrom
md_restarts

Conversation

@jthorton
Copy link
Copy Markdown
Collaborator

@jthorton jthorton commented Mar 23, 2026

Fixes #1719, #1720, #1721
This PR splits up the plain MD protocol unit into a setup and simulation unit and adds the ability to restart failed simulations.

Notes

  • The simulation must reach the production stage to generate a restart file, failures at the equilibration stages will not trigger a restart and will start again.
  • Restarts using the state checkpoint of openmm and are portable between hardware
  • The versions of openfe, gufe and openmm are required to be the same inorder for restarts to be safe

Checklist

  • All new code is appropriately documented (user-facing code must have complete docstrings).
  • Added a news entry, or the changes are not user-facing.
  • Ran pre-commit: you can run pre-commit locally or comment on this PR with pre-commit.ci autofix.

Manual Tests: these are slow so don't need to be run every commit, only before merging and when relevant changes are made (generally at reviewer-discretion).

Developers certificate of origin

@jthorton
Copy link
Copy Markdown
Collaborator Author

pre-commit.ci autofix

@IAlibay IAlibay self-requested a review March 23, 2026 13:21
Copy link
Copy Markdown
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. There's a few areas where it might be good to enhance things, especially when it comes maybe just sticking on the checkpoint reporter the whole way through the simulation, but they wouldn't affect the overall organization of the code.

return pdbs


class PlainMDProtocol(gufe.Protocol):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[scope] It may be good to take the opportunity to move Protocols, results, settings and units to their own files.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this and ended up having a cyclic import issue as the PlainMDProtocol is a type hint in the PlainMDSetupUnit maybe we can look at this later?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:/ that probably needs fixing - can you open an issue? We usually use gufe.Protocol as the type hint for units so that you can create multiple protocols using the same units.

Comment thread src/openfe/protocols/openmm_md/plain_md_methods.py
Comment thread src/openfe/protocols/openmm_md/plain_md_methods.py
Comment thread src/openfe/protocols/openmm_md/plain_md_methods.py
# g. Save the system and positions to file
system_outfile = shared_basepath / "system.xml.bz2"
serialization.serialize(stateA_system, system_outfile)
# not really need if we save out the pre-minimized file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somewhat agree, however PDBs are low precision, savin the npy is good in my opinion.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed that comment.

traj.save_pdb(shared_basepath / output_settings.minimized_structure)
# equilibrate
# NVT equilibration
if equil_steps_nvt:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[scope] Thinking about this - is there any reason we can't just add the checkpoint reporter at this stage? It's more likely that someone running the PlainMD protocol would be running longer equilibrations / productions, so it might be nice to be able to restart at any point during it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thats a good idea, and we could use the total step count to work out which stage the simulation was at?

Comment thread src/openfe/protocols/openmm_md/plain_md_methods.py
stateA_topology, stateA_positions, file=f, keepIds=True
)

# 10. Get platform
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] The comments will need to be updated once the structure is settled.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if x.getName() == "MonteCarloBarostat":
x.setFrequency(0)

simulation.context.setVelocitiesToTemperature(to_openmm(temperature))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[scope / nit] it would definitely be good to split this up into method calls or something like that. There's a lot of code in an "else" section.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jthorton
Copy link
Copy Markdown
Collaborator Author

pre-commit.ci autofix

@jthorton
Copy link
Copy Markdown
Collaborator Author

pre-commit.ci autofix

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 96.15385% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.59%. Comparing base (7959cc6) to head (2f5ee38).

Files with missing lines Patch % Lines
src/openfe/protocols/openmm_md/plain_md_methods.py 95.50% 8 Missing ⚠️
...fe/tests/protocols/openmm_md/test_plain_md_slow.py 0.00% 2 Missing ⚠️
src/openfe/protocols/openmm_afe/base_afe_units.py 50.00% 1 Missing ⚠️
src/openfe/protocols/openmm_septop/base_units.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1884      +/-   ##
==========================================
- Coverage   94.78%   90.59%   -4.20%     
==========================================
  Files         210      211       +1     
  Lines       18841    19028     +187     
==========================================
- Hits        17859    17238     -621     
- Misses        982     1790     +808     
Flag Coverage Δ
fast-tests 90.59% <96.15%> (?)
slow-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jthorton
Copy link
Copy Markdown
Collaborator Author

pre-commit.ci autofix

@jthorton jthorton requested a review from IAlibay April 22, 2026 14:16
Copy link
Copy Markdown
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an initial half of the review (only got to the docstring of _run_MD.

Comment thread docs/reference/api/openmm_md.rst
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do this as a separate file? There's a few places where we want to keep uncharged inputs and I'd prefer untangle that from this PR (mostly so we don't have to think about two things at the same time).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert this for now, it was just very slow while doing local testing and I'll come back to it in a separate PR.

# returns a dict of repeat_id: sorted list of ProtocolUnitResult
return repeats

def _validate(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm being incredibly nit picky on this one - so please push back. Any chance you could move this method above _create.

This is purely an asthetic thing of "that's the order we've been doing for the other Protocols".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem done!

Comment thread src/openfe/protocols/openmm_md/plain_md_methods.py Outdated
def _validate(
self,
stateA: ChemicalSystem,
stateB: ChemicalSystem,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a check that stateA is stateB?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, done and added a test.

Comment thread src/openfe/protocols/openmm_md/plain_md_methods.py
Comment thread src/openfe/protocols/openmm_md/plain_md_methods.py
# This technically should be NotImplementedError
# but gufe.Protocol.validate calls `_validate` wrapped around an
# except for NotImplementedError, so we can't raise it here
raise NotImplementedError("Can't extend simulations yet")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅 this doesn't match the comment - we've been using ValueError here because NotImplementedError gets squashed in validate.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't want to change the error that was raised in the protocol but I guess we are breaking a lot of stuff here, updated!

Comment thread src/openfe/protocols/openmm_md/plain_md_methods.py Outdated
Comment thread src/openfe/protocols/openmm_md/plain_md_methods.py Outdated
Copy link
Copy Markdown
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing, otherwise it looks good to me!

Worker method to set the temperature, barostat and run dynamics and save final structure output.
"""
# set the velocities to temperature
simulation.context.setVelocitiesToTemperature(to_openmm(temperature))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing this you are reassigning velocities even if you're in the middle of a restart.

During a restart, would loadState not assign velocities that we want to continue from?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch! I have added another flag to run dynamics to control if we should reassign the velocities, this introduces another parameter to track the state, similar to production started but I think this is clear?

Comment thread src/openfe/protocols/openmm_md/plain_md_methods.py Outdated
jthorton and others added 6 commits April 24, 2026 09:45
Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
@jthorton
Copy link
Copy Markdown
Collaborator Author

pre-commit.ci autofix

Copy link
Copy Markdown
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks amazing - just one thing and one question and it's good to go!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a news entry please?

output_settings: MDOutputSettings,
verbose: bool = True,
output_path: None | pathlib.Path = None,
reinitialize_velocities: bool = True,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks!

)

# add the checkpoint reporter so we can recover during the equilibration / production phases
if output_settings.checkpoint_storage_filename:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it doesn't but just in case - could you confirm that on restart the CheckpointReporter doesn't reset the global step count?

output_path=output_path,
reinitialize_velocities=reinitialize_velocities,
)
# if we have run this stage we then need to reinitialize velocities in the next stages
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, but frankly I don't think there's a way around this.

@github-actions
Copy link
Copy Markdown

No API break detected ✅

@atravitz atravitz added this to the 1.11.0 milestone Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch PlainMDProtocol checkpointing to a serialized State for better inter-machine compatibility

3 participants