Add Colocated Python support for Pathways#1353
Conversation
2faf540 to
6eac867
Compare
muyangyuapple
left a comment
There was a problem hiding this comment.
Also can you please add it to the lws pathways job? https://github.pie.apple.com/foundation-models/axlearn/blob/3103ca9e3967e73347d89325ab5887f0a0efb90a/axlearn/cloud/gcp/runners/__init__.py#L63
This is used in long running inference service
| ] | ||
| env: | ||
| - "DOCKER_BUILDKIT=1" | ||
| - "DOCKER_BUILDKIT=1\"""" |
There was a problem hiding this comment.
Do we really need to branching the logic in the bundler.py file?
I think we just add some extra dependencies for colocated python. We can use a env_var in the Dockerfile to control that like this: https://github.com/apple/axlearn/blob/main/Dockerfile#L104
Which can be passed in via something like --bundler_spec=INSTALL_PATHWAYS_JAXLIB=true \
There was a problem hiding this comment.
To enable colocated python, an extra image for the colocated python sidecar running on each worker pod needs to be built (so 2 images get built). This implementation works the same for the user - to enable it you just need to pass in --bundler_spec=enable_colocated_python=True.
There was a problem hiding this comment.
Can head pod and sidecar on worker share the same image?
If there are two images to build, we should define another bundler called "sidecar_bundler" here. Adding logics to bundler that is used only by a subset of jobs will make bundler hard to maintain in the long run.
There was a problem hiding this comment.
Refactored enable_colocated_python field to generic sidecars field. Now the flag to enable colocated python is:
--bundler_spec=sidecars=colocated-python
|
|
||
| def __init__(self, cfg: Config, *, bundler: Bundler): | ||
| super().__init__(cfg) | ||
| self._enable_colocated_python = getattr(bundler.config, "enable_colocated_python", False) |
There was a problem hiding this comment.
enable_colocated_python should be an attribute in PathwaysReplicatedJob
Also logically, when you use PathwaysColocatedPythonPlugin, you implies that colocated python is already enabled? So this class should not have a attribute call self._enable_colocated_python?
There was a problem hiding this comment.
PathwaysColocatedPythonPlugin is for job types which support colocated python but doesn't require it to be enabled; the properties pathways_server_image and pathways_proxy_image return different values based on whether colocated python is enabled or disabled. Later on, it could be worth moving more pathways-specific logic to PathwaysColocatedPythonPlugin and making it just PathwaysPlugin - currently PathwaysReplicatedJob and PathwaysLeaderWorkerTemplate have a lot of duplicated logic.
| if self._tpu_type not in USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS: | ||
| raise NotImplementedError(f"Missing system characteristics for {self._tpu_type}") | ||
|
|
||
| self._colocated_python = cfg.colocated_python.instantiate(bundler=bundler) |
There was a problem hiding this comment.
We should instantiate it only when colocated python is enabled?
There was a problem hiding this comment.
It's still used when colocated python is disabled to get pathways_server_image and pathways_proxy_image.
It's already implemented in |
f674e10 to
276cd87
Compare
8a84231 to
463c966
Compare
Co-authored-by: lkolluru05 <lkolluru@google.com>
463c966 to
e3ac888
Compare
Disabled by default. Can be enabled by passing
--bundler_spec=sidecars=colocated-pythonwhen launching a Pathways job or LWS job. Both artifact registry and CloudBuild bundlers are supported.This PR is a refactored version of #1350