vine: return recovery tasks#4428
Conversation
|
Let's talk about this a bit. What does the DAG scheduler actually need to know? 1 - That a file was removed after creation? |
|
And does the answer change if the DAG scheduler (and not the task scheduler) have responsibility for recovering from failures? |
|
The answer changes to 1 if the DAG scheduler is responsible for recovery from failures. |
|
Sorry for mixing my questions. Back to the first questions: Are you saying that the DAG scheduler actually cares about the files and not the task per se? |
|
Yes. The DAG scheduler cares about the completion of a recovery task only because it tells the DAG scheduler that a file was lost/pruned before but now it is available again. |
|
|
||
| /** Enable recovery tasks to be returned by vine_wait. | ||
| By default, recovery tasks are handled internally by the manager. **/ | ||
| int vine_enable_return_recovery_tasks(struct vine_manager *m); |
There was a problem hiding this comment.
I'd suggest using: vine_enable_recovery_tasks instead. Less to explain what the "return" means. The behaviour is then to return a would be recovery task that would produced the files lost.
There was a problem hiding this comment.
Hmm, I think that name would be confusing too. Recovery tasks are always enabled and used internally whenever temporary files are lost. This API is changing the behavior so that recovery tasks are given back to the caller rather than consumed internally. Which is (currently) a fringe use case. I think it's better to have a forbidding name for this feature...
There was a problem hiding this comment.
What about vine_enable_external_recovery_handling()?
| LIST_ITERATE(t->output_mounts, m) | ||
| { | ||
| if (m && m->file && m->file->original_producer_task_id > 0) { | ||
| return m->file->original_producer_task_id; |
There was a problem hiding this comment.
Should the original producer task id be an attribute of the struct vine_task instead?
There was a problem hiding this comment.
original_producer_task_id should be an attribute of vine_file and the original dicussion is in #4168. This variable is mainly used for tracking which original task produced a TEMP file, so it fits better in vine_file because recovery is driven by the lost file.
Proposed Changes
Goals:
Merge Checklist
The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.
make testRun local tests prior to pushing.make formatFormat source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)make lintRun lint on source code prior to pushing.