Skip to content

Extract and validate path artifacts during ingestion#167

Open
tomvothecoder wants to merge 1 commit intoE3SM-Project:mainfrom
tomvothecoder:feature/163-path-artifacts
Open

Extract and validate path artifacts during ingestion#167
tomvothecoder wants to merge 1 commit intoE3SM-Project:mainfrom
tomvothecoder:feature/163-path-artifacts

Conversation

@tomvothecoder
Copy link
Copy Markdown
Collaborator

Description

Adds issue #163 support for path-based artifacts by parsing values from CaseDocs XML, threading them through ingestion metadata, and mapping valid host paths to simulation artifacts during ingest.

  • Parse CASEROOT from env_case.xml

  • Parse RUNDIR, DOUT_S_ROOT, and POSTRUN_SCRIPT from env_run.xml

  • Derive run script artifact path as ${CASEROOT}/.case.run

  • Parse POSTRUN_SCRIPT command with first-token extraction for script path

  • Validate candidate paths on ingest host; warn and omit artifacts for missing/invalid paths

  • Add parser and ingest tests for extraction, mapping, and warn-and-omit behavior

  • Closes [Enhancement]: Update parser to extract path artifacts #163

Checklist

  • Code follows project style guidelines
  • Self-reviewed code
  • No new warnings
  • Tests added or updated (if needed)
  • All tests pass (locally and CI/CD)
  • Documentation/comments updated (if needed)
  • Breaking change noted (if applicable)

Deployment Notes (if any)

No schema migration required.
No special deployment steps required.

@tomvothecoder tomvothecoder requested a review from Copilot April 14, 2026 21:27
@tomvothecoder tomvothecoder marked this pull request as ready for review April 14, 2026 21:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for ingesting path-based artifacts by extracting path fields from
CaseDocs XML and mapping validated host paths into SimulationCreate.artifacts
during ingestion.

Changes:

  • Extend CaseDocs parsing to extract CASEROOT, RUNDIR, DOUT_S_ROOT, and
    POSTRUN_SCRIPT.
  • Thread extracted path metadata through ParsedSimulation and the main parser.
  • Attach validated artifact paths (output/archive/run script/postprocess script)
    to SimulationCreate, with tests covering mapping and warn-and-omit behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
backend/app/features/ingestion/parsers/case_docs.py Extract additional CaseDocs path values (CASEROOT/RUNDIR/DOUT_S_ROOT/POSTRUN_SCRIPT).
backend/app/features/ingestion/parsers/parser.py Thread new path metadata fields into ParsedSimulation.
backend/app/features/ingestion/parsers/types.py Add path-related fields to ParsedSimulation.
backend/app/features/ingestion/ingest.py Validate/derive candidate paths and attach them as ArtifactCreate records.
backend/tests/features/ingestion/parsers/test_case_docs.py Add parser tests for newly extracted path fields.
backend/tests/features/ingestion/test_ingest.py Add ingestion tests for artifact mapping and warn-and-omit behavior.

Comment on lines +394 to +509
def _attach_path_artifacts(
simulation: SimulationCreate,
parsed_simulation: ParsedSimulation,
) -> SimulationCreate:
path_artifacts = _build_path_artifacts(parsed_simulation)
if not path_artifacts:
return simulation

return simulation.model_copy(update={"artifacts": path_artifacts})


def _build_path_artifacts(parsed_simulation: ParsedSimulation) -> list[ArtifactCreate]:
execution_dir = parsed_simulation.execution_dir
path_artifacts: list[ArtifactCreate] = []

output_path = _validate_existing_path(
parsed_simulation.output_path,
source_name="RUNDIR",
execution_dir=execution_dir,
)
archive_path = _validate_existing_path(
parsed_simulation.archive_path,
source_name="DOUT_S_ROOT",
execution_dir=execution_dir,
)
run_script_path = _derive_case_run_script_path(parsed_simulation.case_root)
run_script_path = _validate_existing_path(
run_script_path,
source_name="CASEROOT/.case.run",
execution_dir=execution_dir,
)
postprocessing_path = _extract_postprocessing_script_path(
parsed_simulation.postprocessing_script,
execution_dir=execution_dir,
)
postprocessing_path = _validate_existing_path(
postprocessing_path,
source_name="POSTRUN_SCRIPT",
execution_dir=execution_dir,
)

_append_path_artifact(path_artifacts, ArtifactKind.OUTPUT, output_path)
_append_path_artifact(path_artifacts, ArtifactKind.ARCHIVE, archive_path)
_append_path_artifact(path_artifacts, ArtifactKind.RUN_SCRIPT, run_script_path)
_append_path_artifact(
path_artifacts,
ArtifactKind.POSTPROCESS_SCRIPT,
postprocessing_path,
)

return path_artifacts


def _append_path_artifact(
artifacts: list[ArtifactCreate], kind: ArtifactKind, uri: str | None
) -> None:
if uri is None:
return

artifacts.append(ArtifactCreate(kind=kind, uri=uri))


def _derive_case_run_script_path(case_root: str | None) -> str | None:
normalized_case_root = _normalize_path_candidate(case_root)
if normalized_case_root is None:
return None

return str(Path(normalized_case_root) / ".case.run")


def _extract_postprocessing_script_path(
postprocessing_script: str | None,
execution_dir: str,
) -> str | None:
normalized_script = _normalize_path_candidate(postprocessing_script)
if normalized_script is None:
return None

try:
tokens = shlex.split(normalized_script)
except ValueError:
logger.warning(
"Skipping POSTRUN_SCRIPT artifact for '%s': could not parse value '%s'.",
execution_dir,
normalized_script,
)
return None

if not tokens:
return None

return tokens[0]


def _validate_existing_path(
path_value: str | None,
*,
source_name: str,
execution_dir: str,
) -> str | None:
normalized_path = _normalize_path_candidate(path_value)
if normalized_path is None:
return None

candidate_path = Path(normalized_path).expanduser()
if not candidate_path.exists():
logger.warning(
"Skipping %s artifact for '%s': path does not exist on ingest host: %s",
source_name,
execution_dir,
normalized_path,
)
return None

return str(candidate_path)

Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_existing_path() currently accepts any path that exists on the API host and then persists it as an artifact. Because ingest_archive() is used by /ingestions/from-upload (not admin-restricted), a malicious upload could cause the service to store (and later expose) arbitrary host filesystem paths that happen to exist (e.g. /etc/passwd), effectively turning ingestion into a filesystem-existence oracle. Consider gating path-artifact attachment behind an explicit flag passed from the API layer (enabled only for trusted HPC path ingests), or enforcing an allowlist (e.g., restrict to configured machine storage roots / derived CASEROOT subtree) before checking existence.

Copilot uses AI. Check for mistakes.
Comment on lines +488 to +508
def _validate_existing_path(
path_value: str | None,
*,
source_name: str,
execution_dir: str,
) -> str | None:
normalized_path = _normalize_path_candidate(path_value)
if normalized_path is None:
return None

candidate_path = Path(normalized_path).expanduser()
if not candidate_path.exists():
logger.warning(
"Skipping %s artifact for '%s': path does not exist on ingest host: %s",
source_name,
execution_dir,
normalized_path,
)
return None

return str(candidate_path)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_existing_path() only checks Path.exists(). For these artifacts, the expected path type is known (e.g., RUNDIR/DOUT_S_ROOT should be directories; .case.run and POSTRUN_SCRIPT should be files). Without validating is_dir()/is_file() (and ideally requiring absolute paths), ingestion can attach incorrect artifacts when a wrong-type or relative path happens to exist on the ingest host.

Copilot uses AI. Check for mistakes.
Comment on lines 41 to 55
@@ -50,6 +51,7 @@ def parse_env_case(env_case_path: str | Path) -> dict[str, str | None]:
"campaign": campaign,
"experiment_type": experiment_type,
"compset_alias": compset_alias,
"case_root": case_root,
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_env_case() now returns case_root, but the function docstring’s “Dictionary with case metadata … including:” list doesn’t mention this key. Please update the docstring to reflect the new return shape so callers/tests have an accurate contract.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c2ddacd0e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +498 to +499
candidate_path = Path(normalized_path).expanduser()
if not candidate_path.exists():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resolve relative artifact paths before validation

_validate_existing_path checks Path(normalized_path).expanduser() directly, so any relative value from POSTRUN_SCRIPT/CASEROOT is resolved against the API process working directory instead of the simulation context. In archives where these entries are relative (a common shell-script pattern), valid artifacts in the case/execution tree are treated as missing and silently omitted, so ingestion loses artifact links even though the files exist.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement]: Update parser to extract path artifacts

2 participants