Skip to content

vine: return recovery tasks#4428

Open
JinZhou5042 wants to merge 5 commits into
cooperative-computing-lab:masterfrom
JinZhou5042:return-recovery-tasks
Open

vine: return recovery tasks#4428
JinZhou5042 wants to merge 5 commits into
cooperative-computing-lab:masterfrom
JinZhou5042:return-recovery-tasks

Conversation

@JinZhou5042

Copy link
Copy Markdown
Member

Proposed Changes

Goals:

  • The executor wants the manager to return recovery tasks on completion so that it knows pruned files are present again, and then it can prune recovered files when their consumers finish.
  • When taskvine returns a recovery task to the executor, it wants to know what the original task is so that it can target which node it belongs to. This is because each original task corresponds to a node in the graph, which recovery tasks can be dynamic.

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update: Update the manual to reflect user-visible changes.
  • Type Labels: Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM: Mark your PR as ready to merge.

@JinZhou5042 JinZhou5042 self-assigned this Jun 23, 2026
@dthain

dthain commented Jun 23, 2026

Copy link
Copy Markdown
Member

Let's talk about this a bit. What does the DAG scheduler actually need to know?

1 - That a file was removed after creation?
2 - That a file was re-created after it was lost?
3 - That a recovery task has been created following a loss?
4 - That a recovery task has completed?

@dthain

dthain commented Jun 23, 2026

Copy link
Copy Markdown
Member

And does the answer change if the DAG scheduler (and not the task scheduler) have responsibility for recovering from failures?

@JinZhou5042

Copy link
Copy Markdown
Member Author

Let's talk about this a bit. What does the DAG scheduler actually need to know?

1 - That a file was removed after creation? 2 - That a file was re-created after it was lost? 3 - That a recovery task has been created following a loss? 4 - That a recovery task has completed?

  1. The executor needs to know the completion of a recovery task, so that it can mark the corresponding output file(s) as available, and then proceed with re-pruning.

@JinZhou5042

Copy link
Copy Markdown
Member Author

And does the answer change if the DAG scheduler (and not the task scheduler) have responsibility for recovering from failures?

The answer changes to 1 if the DAG scheduler is responsible for recovery from failures.

@dthain

dthain commented Jun 23, 2026

Copy link
Copy Markdown
Member

Sorry for mixing my questions.

Back to the first questions:

Are you saying that the DAG scheduler actually cares about the files and not the task per se?

@JinZhou5042

Copy link
Copy Markdown
Member Author

Yes. The DAG scheduler cares about the completion of a recovery task only because it tells the DAG scheduler that a file was lost/pruned before but now it is available again.

Comment thread taskvine/src/manager/taskvine.h Outdated

/** Enable recovery tasks to be returned by vine_wait.
By default, recovery tasks are handled internally by the manager. **/
int vine_enable_return_recovery_tasks(struct vine_manager *m);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using: vine_enable_recovery_tasks instead. Less to explain what the "return" means. The behaviour is then to return a would be recovery task that would produced the files lost.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think that name would be confusing too. Recovery tasks are always enabled and used internally whenever temporary files are lost. This API is changing the behavior so that recovery tasks are given back to the caller rather than consumed internally. Which is (currently) a fringe use case. I think it's better to have a forbidding name for this feature...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about vine_enable_external_recovery_handling()?

Comment thread taskvine/src/manager/vine_manager.h Outdated
LIST_ITERATE(t->output_mounts, m)
{
if (m && m->file && m->file->original_producer_task_id > 0) {
return m->file->original_producer_task_id;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the original producer task id be an attribute of the struct vine_task instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original_producer_task_id should be an attribute of vine_file and the original dicussion is in #4168. This variable is mainly used for tracking which original task produced a TEMP file, so it fits better in vine_file because recovery is driven by the lost file.

@JinZhou5042 JinZhou5042 mentioned this pull request Jun 25, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants