feat(IS-667): Add automatic cleanup of outdated ExApp Docker images#837
feat(IS-667): Add automatic cleanup of outdated ExApp Docker images#837aliz71 wants to merge 2 commits intonextcloud:mainfrom
Conversation
Signed-off-by: Ali <ali.zahedi@fotograf.de>
7642402 to
f756fe8
Compare
oleksandr-nc
left a comment
There was a problem hiding this comment.
Your "Key Design Decisions" section describes what you did, which is useful.
What we wouldd like to see in a PR of this shape is what you considered and rejected, and why
dangling=truevs. per-ExApp ref matching: AppAPI owns theexAppstable - we have ground truth for every installed ExApp's current image ref. Why not enumerate images on the daemon whose repo matches a known ExApp, exclude the current tag, and remove the rest older than N days? That is how we would actually solve #667 without relying on the cleanup-on-update toggle. What made you land ondangling=true?- AppAPI-only vs. extending HaRP: You note "upstream HaRP does not support
/images/pruneyet." But upstream here is us -nextcloud/HaRP. We could addPOST /docker/exapp/images_cleanupalongside the existingdocker_exapp_create/_start/_stop/_removeendpoints; the pattern is sitting right there inhaproxy_agent.py. What did you weigh when deciding to skip HaRP vs. doing a HaRP PR first? - Hook in the CLI
Updatecommand vs. inDockerActions::deployExAppHarp/deployExApp: The current hook fires only when the flow passes throughocc app_api:app:update. Was that intentional, or implementation-cost-driven?
| try { | ||
| $this->dockerActions->initGuzzleClient($daemon); | ||
| $dockerUrl = $this->dockerActions->buildDockerUrl($daemon); | ||
| $result = $this->dockerActions->pruneImages($dockerUrl, ['dangling' => ['true']]); |
There was a problem hiding this comment.
Let me trace a concrete scenario through this and make sure I understand it:
Admin installs my ExApp. Three weeks later they run occ app_api:app:update some_exapp, which pulls new image. The "cleanup on update" toggle is OFF (it is off by default).
Another week passes, then this background job fires.
In that scenario, which images does this prune call actually remove on the daemon, and how much disk space does it reclaim for this ExApp?
There was a problem hiding this comment.
With dangling=true, the old image (some_exapp:1.0) would not be removed; because it is still tagged, it is simply "unused." Only untagged images (:) are pruned under that filter.
I initially used dangling=true as a conservative approach to avoid removing images that other workloads on the same Docker host might need. However, if the old images remain tagged, they aren't caught by the dangling filter. Maybe the better approach is to remove the filter to match docker image prune -a, which removes all images not referenced by a running container. This ensures the old 1.0 image is cleaned up.
Disk space reclaimed: Dependent on image size.
| return; | ||
| } | ||
| // Image cleanup only applies to Docker daemons | ||
| if ($daemonConfig->getAcceptsDeployId() !== $this->dockerActions->getAcceptsDeployId()) { |
There was a problem hiding this comment.
What concretely happens on a HaRP-fronted daemon when cleanup-on-update is enabled and this hook fires?
Is that the behavior you intended? If yes, what's the reasoning behind the asymmetry with the background job? If no, what should the code do?
There was a problem hiding this comment.
This was unintentional. While the background job explicitly skips HaRP daemons via supportsPrune(), I missed adding the same check to the update path.
| $k8sRoles = []; | ||
| $oldImageName = ''; | ||
| if ($daemonConfig->getAcceptsDeployId() === $this->dockerActions->getAcceptsDeployId()) { | ||
| // Capture old image name before deploying the new version |
There was a problem hiding this comment.
I want to make sure I'm reading this right:
$oldAppInfo = array_merge($appInfo, ['version' => $exApp->getVersion()]);
$oldDeployParams = $this->dockerActions->buildDeployParams($daemonConfig, $oldAppInfo);
$oldImageName = $this->dockerActions->buildBaseImageName($oldDeployParams['image_params'], $daemonConfig);$appInfo here comes from the NEW app manifest (fetched a few lines up). You override version to the currently-installed one. But buildDeployParams and buildBaseImageName also consume $appInfo['image_src'], $appInfo['image_name'], and the daemon's getDeployConfig()['registries'] mapping.
Can you describe, in terms of what $oldImageName ends up being, the outcome in these two scenarios?
(a) ExApp author shipped v1.0 as ghcr.io/me/app and later retagged v1.1 to ghcr.io/org/app in the new manifest. Admin upgrades.
(b) Admin added a new registries entry rewriting ghcr.io/me/* to a local mirror between installing v1.0 and updating to v1.1.
In each case: what image ref is passed into DELETE /images/<name>, what does Docker actually do with it, and what ends up on disk afterwards? I would like to understand this before we ship because both are realistic.
There was a problem hiding this comment.
Thanks for catching this, actually I would say this is an implementation error.
(a) $oldImageName resolves to the new image rather than the old one because I am reconstructing it from the new manifest instead of using the data from the original install.
(b) The same problem exists here: both the registry mapping and the image tag are derived from the current configuration/new manifest, rather than the originally deployed metadata.
Summary: In both cases, there is no damage since consequently, docker returns a 409 error (image in use by the new container), and nothing is deleted, but no cleanup occurs.
| private readonly LoggerInterface $logger, | ||
| ) { | ||
| parent::__construct($time); | ||
| $intervalDays = $this->appConfig->getValueInt( |
There was a problem hiding this comment.
Two questions on scheduling semantics - as we want to document these for the admin:
- Admin sets
image_cleanup_interval_days = 0. Your code doesmax($intervalDays, 1) * SECONDS_PER_DAY-->setInterval(86400), andrun()then no-ops viaisCleanupDisabled(). End to end: does the job (a) not execute at all, or (b) execute every day and exit immediately insiderun()? What doesocc background-job:listshow in each case? - Admin starts with
interval_days = 1, job runs at T, T+1d, T+2d, T+3d. At T+3.5d admin changes it to7. When does the next execution fire - T+3.5d+7d, T+3d+7d, T+4d with an immediate no-op, or something else?
There was a problem hiding this comment.
Q1: It’s (b). The job runs daily and exits immediately via run(). Due to the max(0, 1) logic, the constructor sets a 24-hour interval even if interval_days = 0. Because Nextcloud TimedJobs are registered in info.xml, they cannot be dynamically unregistered to achieve (a). The overhead is minimal, but the job remains active in occ background-job:list.
Q2: The constructor re-evaluates the interval every cron cycle (every 5 mins). If changed to 7 days at T+3.5d:
Next cron: Updates to setInterval(7 days).
Next run: Scheduled for T+10d (Last run at T+3d + 7d interval).
The change applies nearly instantly, and the countdown resets relative to the last execution, not the setting change.
|
To provide full transparency regarding my process: I implemented this solution based on my analysis of the codebase and the Docker API documentation, utilizing AI assistance to help navigate the specific project structure and Nextcloud framework conventions. Given the constraints of the coding challenge, I did not have the opportunity to perform a deep dive into the exhaustive Nextcloud developer documentation or consult with the team on specific architectural preferences. Consequently, I focused on creating a functional, idiomatic implementation that aligns with the visible patterns in the existing repository. |
🚀 Summary
Implements automatic cleanup of outdated ExApp Docker images to manage disk space effectively.
Fixes: #667
🛠️ What was implemented
The solution introduces two complementary cleanup mechanisms:
DockerImageCleanupJob): Periodically removes all dangling Docker images. Features a configurable interval (defaulting to 7 days; set to 0 to disable).Admin UI: Both settings are exposed under a new "Docker image cleanup" section in the admin settings page.
🏗️ Approach & Refinement
This PR followed a Spec-Driven Development (SDD) workflow. The technical specification and implementation plan are included in
specs/docker-image-cleanup/.Development Highlights:
Http::STATUS_*constants and utilizedUtil::humanFileSize().getValueInt()andgetValueBool()for config retrieval.Application.phpand refactored logic into descriptive private methods (e.g.,supportsPrune(),pruneImagesOnAllDaemons()).TimedJobframework for scheduling rather than manual timestamp tracking.🧠 Key Design Decisions
/images/pruneendpoint.dangling=truefilter to ensure only unused layers/images are removed.🧪 Test Plan
Automated Tests
DockerImageCleanupJob(8),DockerActions(11), and update commands (6).Manual Verification
0successfully disables the background job.