Extract and validate path artifacts during ingestion#167
Extract and validate path artifacts during ingestion#167tomvothecoder wants to merge 1 commit intoE3SM-Project:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds support for ingesting path-based artifacts by extracting path fields from
CaseDocs XML and mapping validated host paths into SimulationCreate.artifacts
during ingestion.
Changes:
- Extend CaseDocs parsing to extract
CASEROOT,RUNDIR,DOUT_S_ROOT, and
POSTRUN_SCRIPT. - Thread extracted path metadata through
ParsedSimulationand the main parser. - Attach validated artifact paths (output/archive/run script/postprocess script)
toSimulationCreate, with tests covering mapping and warn-and-omit behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/app/features/ingestion/parsers/case_docs.py | Extract additional CaseDocs path values (CASEROOT/RUNDIR/DOUT_S_ROOT/POSTRUN_SCRIPT). |
| backend/app/features/ingestion/parsers/parser.py | Thread new path metadata fields into ParsedSimulation. |
| backend/app/features/ingestion/parsers/types.py | Add path-related fields to ParsedSimulation. |
| backend/app/features/ingestion/ingest.py | Validate/derive candidate paths and attach them as ArtifactCreate records. |
| backend/tests/features/ingestion/parsers/test_case_docs.py | Add parser tests for newly extracted path fields. |
| backend/tests/features/ingestion/test_ingest.py | Add ingestion tests for artifact mapping and warn-and-omit behavior. |
| def _attach_path_artifacts( | ||
| simulation: SimulationCreate, | ||
| parsed_simulation: ParsedSimulation, | ||
| ) -> SimulationCreate: | ||
| path_artifacts = _build_path_artifacts(parsed_simulation) | ||
| if not path_artifacts: | ||
| return simulation | ||
|
|
||
| return simulation.model_copy(update={"artifacts": path_artifacts}) | ||
|
|
||
|
|
||
| def _build_path_artifacts(parsed_simulation: ParsedSimulation) -> list[ArtifactCreate]: | ||
| execution_dir = parsed_simulation.execution_dir | ||
| path_artifacts: list[ArtifactCreate] = [] | ||
|
|
||
| output_path = _validate_existing_path( | ||
| parsed_simulation.output_path, | ||
| source_name="RUNDIR", | ||
| execution_dir=execution_dir, | ||
| ) | ||
| archive_path = _validate_existing_path( | ||
| parsed_simulation.archive_path, | ||
| source_name="DOUT_S_ROOT", | ||
| execution_dir=execution_dir, | ||
| ) | ||
| run_script_path = _derive_case_run_script_path(parsed_simulation.case_root) | ||
| run_script_path = _validate_existing_path( | ||
| run_script_path, | ||
| source_name="CASEROOT/.case.run", | ||
| execution_dir=execution_dir, | ||
| ) | ||
| postprocessing_path = _extract_postprocessing_script_path( | ||
| parsed_simulation.postprocessing_script, | ||
| execution_dir=execution_dir, | ||
| ) | ||
| postprocessing_path = _validate_existing_path( | ||
| postprocessing_path, | ||
| source_name="POSTRUN_SCRIPT", | ||
| execution_dir=execution_dir, | ||
| ) | ||
|
|
||
| _append_path_artifact(path_artifacts, ArtifactKind.OUTPUT, output_path) | ||
| _append_path_artifact(path_artifacts, ArtifactKind.ARCHIVE, archive_path) | ||
| _append_path_artifact(path_artifacts, ArtifactKind.RUN_SCRIPT, run_script_path) | ||
| _append_path_artifact( | ||
| path_artifacts, | ||
| ArtifactKind.POSTPROCESS_SCRIPT, | ||
| postprocessing_path, | ||
| ) | ||
|
|
||
| return path_artifacts | ||
|
|
||
|
|
||
| def _append_path_artifact( | ||
| artifacts: list[ArtifactCreate], kind: ArtifactKind, uri: str | None | ||
| ) -> None: | ||
| if uri is None: | ||
| return | ||
|
|
||
| artifacts.append(ArtifactCreate(kind=kind, uri=uri)) | ||
|
|
||
|
|
||
| def _derive_case_run_script_path(case_root: str | None) -> str | None: | ||
| normalized_case_root = _normalize_path_candidate(case_root) | ||
| if normalized_case_root is None: | ||
| return None | ||
|
|
||
| return str(Path(normalized_case_root) / ".case.run") | ||
|
|
||
|
|
||
| def _extract_postprocessing_script_path( | ||
| postprocessing_script: str | None, | ||
| execution_dir: str, | ||
| ) -> str | None: | ||
| normalized_script = _normalize_path_candidate(postprocessing_script) | ||
| if normalized_script is None: | ||
| return None | ||
|
|
||
| try: | ||
| tokens = shlex.split(normalized_script) | ||
| except ValueError: | ||
| logger.warning( | ||
| "Skipping POSTRUN_SCRIPT artifact for '%s': could not parse value '%s'.", | ||
| execution_dir, | ||
| normalized_script, | ||
| ) | ||
| return None | ||
|
|
||
| if not tokens: | ||
| return None | ||
|
|
||
| return tokens[0] | ||
|
|
||
|
|
||
| def _validate_existing_path( | ||
| path_value: str | None, | ||
| *, | ||
| source_name: str, | ||
| execution_dir: str, | ||
| ) -> str | None: | ||
| normalized_path = _normalize_path_candidate(path_value) | ||
| if normalized_path is None: | ||
| return None | ||
|
|
||
| candidate_path = Path(normalized_path).expanduser() | ||
| if not candidate_path.exists(): | ||
| logger.warning( | ||
| "Skipping %s artifact for '%s': path does not exist on ingest host: %s", | ||
| source_name, | ||
| execution_dir, | ||
| normalized_path, | ||
| ) | ||
| return None | ||
|
|
||
| return str(candidate_path) | ||
|
|
There was a problem hiding this comment.
_validate_existing_path() currently accepts any path that exists on the API host and then persists it as an artifact. Because ingest_archive() is used by /ingestions/from-upload (not admin-restricted), a malicious upload could cause the service to store (and later expose) arbitrary host filesystem paths that happen to exist (e.g. /etc/passwd), effectively turning ingestion into a filesystem-existence oracle. Consider gating path-artifact attachment behind an explicit flag passed from the API layer (enabled only for trusted HPC path ingests), or enforcing an allowlist (e.g., restrict to configured machine storage roots / derived CASEROOT subtree) before checking existence.
| def _validate_existing_path( | ||
| path_value: str | None, | ||
| *, | ||
| source_name: str, | ||
| execution_dir: str, | ||
| ) -> str | None: | ||
| normalized_path = _normalize_path_candidate(path_value) | ||
| if normalized_path is None: | ||
| return None | ||
|
|
||
| candidate_path = Path(normalized_path).expanduser() | ||
| if not candidate_path.exists(): | ||
| logger.warning( | ||
| "Skipping %s artifact for '%s': path does not exist on ingest host: %s", | ||
| source_name, | ||
| execution_dir, | ||
| normalized_path, | ||
| ) | ||
| return None | ||
|
|
||
| return str(candidate_path) |
There was a problem hiding this comment.
_validate_existing_path() only checks Path.exists(). For these artifacts, the expected path type is known (e.g., RUNDIR/DOUT_S_ROOT should be directories; .case.run and POSTRUN_SCRIPT should be files). Without validating is_dir()/is_file() (and ideally requiring absolute paths), ingestion can attach incorrect artifacts when a wrong-type or relative path happens to exist on the ingest host.
| @@ -50,6 +51,7 @@ def parse_env_case(env_case_path: str | Path) -> dict[str, str | None]: | |||
| "campaign": campaign, | |||
| "experiment_type": experiment_type, | |||
| "compset_alias": compset_alias, | |||
| "case_root": case_root, | |||
| } | |||
There was a problem hiding this comment.
parse_env_case() now returns case_root, but the function docstring’s “Dictionary with case metadata … including:” list doesn’t mention this key. Please update the docstring to reflect the new return shape so callers/tests have an accurate contract.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5c2ddacd0e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| candidate_path = Path(normalized_path).expanduser() | ||
| if not candidate_path.exists(): |
There was a problem hiding this comment.
Resolve relative artifact paths before validation
_validate_existing_path checks Path(normalized_path).expanduser() directly, so any relative value from POSTRUN_SCRIPT/CASEROOT is resolved against the API process working directory instead of the simulation context. In archives where these entries are relative (a common shell-script pattern), valid artifacts in the case/execution tree are treated as missing and silently omitted, so ingestion loses artifact links even though the files exist.
Useful? React with 👍 / 👎.
Description
Adds issue #163 support for path-based artifacts by parsing values from CaseDocs XML, threading them through ingestion metadata, and mapping valid host paths to simulation artifacts during ingest.
Parse
CASEROOTfromenv_case.xmlParse
RUNDIR,DOUT_S_ROOT, andPOSTRUN_SCRIPTfromenv_run.xmlDerive run script artifact path as
${CASEROOT}/.case.runParse
POSTRUN_SCRIPTcommand with first-token extraction for script pathValidate candidate paths on ingest host; warn and omit artifacts for missing/invalid paths
Add parser and ingest tests for extraction, mapping, and warn-and-omit behavior
Closes [Enhancement]: Update parser to extract path artifacts #163
Checklist
Deployment Notes (if any)
No schema migration required.
No special deployment steps required.