Skip to content

feat(config): add --config YAML configuration system with Hydra compose API#121

Open
coketaste wants to merge 24 commits into
developfrom
coketaste/v2-config-driven
Open

feat(config): add --config YAML configuration system with Hydra compose API#121
coketaste wants to merge 24 commits into
developfrom
coketaste/v2-config-driven

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

@coketaste coketaste commented May 3, 2026

Summary

Adds --config as a composable, Hydra-based YAML alternative to --additional-context JSON strings, available on both run and build commands.

  • New src/madengine/config/ module — HydraConfigLoader (Compose API), ConfigTranslator (YAML → internal context key mapping), ConfigValidator (cross-field checks), with a shared
    deep_merge utility
  • Hydra config groups under src/madengine/configs/ — default groups (platform, scheduler, hardware, launcher) and append-only groups (+profile, +env, +tools, +data, +build) with
    pre-built YAML fragments for all supported launchers, hardware profiles, and profiling tools
  • Mutual exclusion — --config cannot be combined with --additional-context / --additional-context-file; validated at CLI entry point on both run and build
  • Empty-value filtering — empty lists and dicts are excluded from the translated context so they don't shadow existing defaults
  • examples/configs/ — three comprehensive annotated templates (local.yaml, slurm.yaml, k8s.yaml) covering every supported field, plus 25 ready-to-run demo files organised by target
    (demo/local/, demo/slurm/, demo/k8s/)
  • Fixture fix — dummy_torchrun and dummy_torchrun_multi n_gpus corrected from "1" to "4" so local multi-GPU demo runs show accurate 1N×4G topology

Behaviour

  • Config group overrides
    madengine run --tags model --config scheduler=slurm --config launcher=torchrun

  • Append-only groups
    madengine run --tags model --config +profile=mi300x_8gpu --config +env=nccl_debug

  • User YAML file (highest priority after inline overrides)
    madengine run --config my_job.yaml

  • Mix file + inline overrides
    madengine run --config my_job.yaml --config distributed.nnodes=8

▎ When a user YAML sets a distributed launcher, distributed.enabled: true must also be set explicitly — the default launcher: none group sets enabled: false and setting a launcher key
▎ alone does not override it.

Test plan

  • pytest tests/unit/test_config_loader.py tests/unit/test_config_schema.py tests/unit/test_config_translator.py tests/unit/test_config_integration.py
    tests/unit/test_hydra_config_loader.py
  • madengine run --config examples/configs/demo/local/single-gpu.yaml completes and writes perf.csv
  • madengine run --config examples/configs/demo/local/multi-gpu-torchrun.yaml shows 1N×4G topology and torchrun in the launcher column
  • madengine run --config my_job.yaml --additional-context '{}' exits with INVALID_ARGS (exit code 4)
  • pre-commit run --all-files passes

coketaste and others added 24 commits May 2, 2026 22:19
Proposes a Hydra-based --config CLI argument with composable YAML
config groups (platform, scheduler, hardware, launcher, profile,
env, tools, data, build) to replace the limited --additional-context
JSON string approach. Coexists with existing args for backward compat.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
11-task TDD implementation plan covering Hydra Compose API integration,
config groups, ConfigTranslator, ConfigValidator, and CLI integration.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…a, build)

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Add madengine.config package with HydraConfigLoader that uses Hydra's
Compose API to load and merge YAML configs from overrides and/or a user
YAML file. Includes stub ConfigTranslator and ConfigValidator for later
tasks, plus 15 passing unit tests covering arg parsing and config loading.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Replace stub to_additional_context() with full implementation that maps
docker/log_error subkeys, extracts model/build/platform/runtime to metadata,
passes through slurm/distributed/tools/env_vars, and promotes container_image
to MAD_CONTAINER_IMAGE. 16 TDD tests added and passing.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Replace stub validate() with full implementation covering slurm/k8s
conflict detection, distributed launcher requirements, nnodes validation,
unknown top-level key detection, and unsupported platform detection.
Use OmegaConf.to_container() to handle nested DictConfig values correctly.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Add --config parameter to the run command accepting YAML files and Hydra
overrides. Config values provide defaults that CLI args override; additional
context from --config is deep-merged with --additional-context (CLI wins).

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Add --config parameter to the build command with the same YAML/Hydra
override support as the run command. Config values serve as defaults
that can be overridden by explicit CLI args (tags, registry, target_archs)
and --additional-context.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Replace merging behavior with strict mutual exclusion: --config cannot
be used alongside --additional-context or --additional-context-file.
Remove now-unused ast import and deep_merge import from run.py.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…sion

Replace the previous merge-based --config handling in the build command
with the same mutual exclusion pattern used in run.py: --config now
errors if combined with --additional-context or --additional-context-file.
Remove unused imports (ast, deep_merge) that were only needed for merging.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Brings in the large CLI/core refactor from develop (validators module,
restructured commands, etc.) as the base for the --config feature work.
Empty log_error_patterns and log_error_benign_patterns were being passed
as [] to additional_context, but the downstream validator rejects empty
arrays. Same for docker build_args/env_vars/mounts as empty dicts.
Now only non-empty collections are included in the translated context.
Document the new Hydra-based --config flag across all user-facing docs:
README, configuration guide, CLI reference, and usage guide.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…topology

dummy_torchrun and dummy_torchrun_multi had n_gpus="1", causing local
Docker runs with distributed.nproc_per_node=4 to show 1N×1G instead of
1N×4G. The model's n_gpus drives GPU device exposure to the container,
not nproc_per_node.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- Add examples/configs/ directory with annotated templates (local, slurm,
  k8s) and ready-to-run demo files for all deployment targets
- Add pointers to examples/configs/ from README, usage.md, and
  configuration.md SLURM/K8s sections
- Remove nonexistent slurm fields (mail_user, mail_type, mem) from
  configuration.md; replace with accurate fields (exclude, constraint,
  exclusive, modules, network_interface, shared_workspace)
- Add callout in distributed training section explaining that
  distributed.enabled: true must be set explicitly when using --config

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
@coketaste coketaste self-assigned this May 3, 2026
Copilot AI review requested due to automatic review settings May 3, 2026 15:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants