feat(config): add --config YAML configuration system with Hydra compose API#121
Open
coketaste wants to merge 24 commits into
Open
feat(config): add --config YAML configuration system with Hydra compose API#121coketaste wants to merge 24 commits into
coketaste wants to merge 24 commits into
Conversation
Proposes a Hydra-based --config CLI argument with composable YAML config groups (platform, scheduler, hardware, launcher, profile, env, tools, data, build) to replace the limited --additional-context JSON string approach. Coexists with existing args for backward compat. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
11-task TDD implementation plan covering Hydra Compose API integration, config groups, ConfigTranslator, ConfigValidator, and CLI integration. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…a, build) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Add madengine.config package with HydraConfigLoader that uses Hydra's Compose API to load and merge YAML configs from overrides and/or a user YAML file. Includes stub ConfigTranslator and ConfigValidator for later tasks, plus 15 passing unit tests covering arg parsing and config loading. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Replace stub to_additional_context() with full implementation that maps docker/log_error subkeys, extracts model/build/platform/runtime to metadata, passes through slurm/distributed/tools/env_vars, and promotes container_image to MAD_CONTAINER_IMAGE. 16 TDD tests added and passing. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Replace stub validate() with full implementation covering slurm/k8s conflict detection, distributed launcher requirements, nnodes validation, unknown top-level key detection, and unsupported platform detection. Use OmegaConf.to_container() to handle nested DictConfig values correctly. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Add --config parameter to the run command accepting YAML files and Hydra overrides. Config values provide defaults that CLI args override; additional context from --config is deep-merged with --additional-context (CLI wins). Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Add --config parameter to the build command with the same YAML/Hydra override support as the run command. Config values serve as defaults that can be overridden by explicit CLI args (tags, registry, target_archs) and --additional-context. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Replace merging behavior with strict mutual exclusion: --config cannot be used alongside --additional-context or --additional-context-file. Remove now-unused ast import and deep_merge import from run.py. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…sion Replace the previous merge-based --config handling in the build command with the same mutual exclusion pattern used in run.py: --config now errors if combined with --additional-context or --additional-context-file. Remove unused imports (ast, deep_merge) that were only needed for merging. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Brings in the large CLI/core refactor from develop (validators module, restructured commands, etc.) as the base for the --config feature work.
Empty log_error_patterns and log_error_benign_patterns were being passed as [] to additional_context, but the downstream validator rejects empty arrays. Same for docker build_args/env_vars/mounts as empty dicts. Now only non-empty collections are included in the translated context.
Document the new Hydra-based --config flag across all user-facing docs: README, configuration guide, CLI reference, and usage guide. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…topology dummy_torchrun and dummy_torchrun_multi had n_gpus="1", causing local Docker runs with distributed.nproc_per_node=4 to show 1N×1G instead of 1N×4G. The model's n_gpus drives GPU device exposure to the container, not nproc_per_node. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- Add examples/configs/ directory with annotated templates (local, slurm, k8s) and ready-to-run demo files for all deployment targets - Add pointers to examples/configs/ from README, usage.md, and configuration.md SLURM/K8s sections - Remove nonexistent slurm fields (mail_user, mail_type, mem) from configuration.md; replace with accurate fields (exclude, constraint, exclusive, modules, network_interface, shared_workspace) - Add callout in distributed training section explaining that distributed.enabled: true must be set explicitly when using --config Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds --config as a composable, Hydra-based YAML alternative to --additional-context JSON strings, available on both run and build commands.
deep_merge utility
pre-built YAML fragments for all supported launchers, hardware profiles, and profiling tools
(demo/local/, demo/slurm/, demo/k8s/)
Behaviour
Config group overrides
madengine run --tags model --config scheduler=slurm --config launcher=torchrun
Append-only groups
madengine run --tags model --config +profile=mi300x_8gpu --config +env=nccl_debug
User YAML file (highest priority after inline overrides)
madengine run --config my_job.yaml
Mix file + inline overrides
madengine run --config my_job.yaml --config distributed.nnodes=8
▎ When a user YAML sets a distributed launcher, distributed.enabled: true must also be set explicitly — the default launcher: none group sets enabled: false and setting a launcher key
▎ alone does not override it.
Test plan
tests/unit/test_hydra_config_loader.py