Skip to content

feat(IS-667): Add automatic cleanup of outdated ExApp Docker images#837

Closed
aliz71 wants to merge 2 commits intonextcloud:mainfrom
aliz71:IS-667_docker_image_cleanup
Closed

feat(IS-667): Add automatic cleanup of outdated ExApp Docker images#837
aliz71 wants to merge 2 commits intonextcloud:mainfrom
aliz71:IS-667_docker_image_cleanup

Conversation

@aliz71
Copy link
Copy Markdown

@aliz71 aliz71 commented Apr 18, 2026

🚀 Summary

Implements automatic cleanup of outdated ExApp Docker images to manage disk space effectively.

Fixes: #667

🛠️ What was implemented

The solution introduces two complementary cleanup mechanisms:

  1. Background Prune Job (DockerImageCleanupJob): Periodically removes all dangling Docker images. Features a configurable interval (defaulting to 7 days; set to 0 to disable).
  2. Cleanup on Update: An optional mechanism that removes the specific old Docker image immediately following a successful ExApp update. This is toggled via an admin setting (disabled by default).

Admin UI: Both settings are exposed under a new "Docker image cleanup" section in the admin settings page.

🏗️ Approach & Refinement

This PR followed a Spec-Driven Development (SDD) workflow. The technical specification and implementation plan are included in specs/docker-image-cleanup/.

Development Highlights:

  • AI-Assisted & Manually Verified: Initial implementation via Claude Code, followed by a line-by-line AI logic review and manual audits of critical paths (error handling, Docker API integration).
  • Iterative Improvements:
    • Standards: Replaced magic numbers with Http::STATUS_* constants and utilized Util::humanFileSize().
    • Type Safety: Switched to getValueInt() and getValueBool() for config retrieval.
    • Clean Code: Centralized config keys in Application.php and refactored logic into descriptive private methods (e.g., supportsPrune(), pruneImagesOnAllDaemons()).
    • Architecture: Leveraged the native TimedJob framework for scheduling rather than manual timestamp tracking.

🧠 Key Design Decisions

  • Non-Blocking: Image cleanup failures are caught and logged; they never block the primary ExApp update flow.
  • Fault Tolerance: A failure on one Docker daemon does not prevent cleanup attempts on other configured daemons.
  • Compatibility: HaRP daemons are currently skipped as the upstream does not yet support the /images/prune endpoint.
  • Safety: Uses the standard dangling=true filter to ensure only unused layers/images are removed.

🧪 Test Plan

Automated Tests

  • Unit Tests: 25 new tests covering DockerImageCleanupJob (8), DockerActions (11), and update commands (6).
  • Static Analysis:
    • Psalm: 0 errors
    • PHP CS Fixer: 0 issues

Manual Verification

  • Update Flow: Verified that enabling the "Cleanup on update" checkbox removes the old image after an ExApp update.
  • Cron Job: Verified that triggering cron with a set interval executes the prune command on the Docker daemon.
  • Disabled State: Verified that setting the interval to 0 successfully disables the background job.
  • Admin UI: Confirmed settings persist correctly in the database and display accurately in the UI.

Signed-off-by: Ali <ali.zahedi@fotograf.de>
@aliz71 aliz71 force-pushed the IS-667_docker_image_cleanup branch from 7642402 to f756fe8 Compare April 18, 2026 09:59
Copy link
Copy Markdown
Contributor

@oleksandr-nc oleksandr-nc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your "Key Design Decisions" section describes what you did, which is useful.

What we wouldd like to see in a PR of this shape is what you considered and rejected, and why

  1. dangling=true vs. per-ExApp ref matching: AppAPI owns the exApps table - we have ground truth for every installed ExApp's current image ref. Why not enumerate images on the daemon whose repo matches a known ExApp, exclude the current tag, and remove the rest older than N days? That is how we would actually solve #667 without relying on the cleanup-on-update toggle. What made you land on dangling=true?
  2. AppAPI-only vs. extending HaRP: You note "upstream HaRP does not support /images/prune yet." But upstream here is us - nextcloud/HaRP. We could add POST /docker/exapp/images_cleanup alongside the existing docker_exapp_create / _start / _stop / _remove endpoints; the pattern is sitting right there in haproxy_agent.py. What did you weigh when deciding to skip HaRP vs. doing a HaRP PR first?
  3. Hook in the CLI Update command vs. in DockerActions::deployExAppHarp / deployExApp: The current hook fires only when the flow passes through occ app_api:app:update. Was that intentional, or implementation-cost-driven?

try {
$this->dockerActions->initGuzzleClient($daemon);
$dockerUrl = $this->dockerActions->buildDockerUrl($daemon);
$result = $this->dockerActions->pruneImages($dockerUrl, ['dangling' => ['true']]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me trace a concrete scenario through this and make sure I understand it:

Admin installs my ExApp. Three weeks later they run occ app_api:app:update some_exapp, which pulls new image. The "cleanup on update" toggle is OFF (it is off by default).
Another week passes, then this background job fires.

In that scenario, which images does this prune call actually remove on the daemon, and how much disk space does it reclaim for this ExApp?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With dangling=true, the old image (some_exapp:1.0) would not be removed; because it is still tagged, it is simply "unused." Only untagged images (:) are pruned under that filter.

I initially used dangling=true as a conservative approach to avoid removing images that other workloads on the same Docker host might need. However, if the old images remain tagged, they aren't caught by the dangling filter. Maybe the better approach is to remove the filter to match docker image prune -a, which removes all images not referenced by a running container. This ensures the old 1.0 image is cleaned up.

Disk space reclaimed: Dependent on image size.

return;
}
// Image cleanup only applies to Docker daemons
if ($daemonConfig->getAcceptsDeployId() !== $this->dockerActions->getAcceptsDeployId()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What concretely happens on a HaRP-fronted daemon when cleanup-on-update is enabled and this hook fires?

Is that the behavior you intended? If yes, what's the reasoning behind the asymmetry with the background job? If no, what should the code do?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was unintentional. While the background job explicitly skips HaRP daemons via supportsPrune(), I missed adding the same check to the update path.

$k8sRoles = [];
$oldImageName = '';
if ($daemonConfig->getAcceptsDeployId() === $this->dockerActions->getAcceptsDeployId()) {
// Capture old image name before deploying the new version
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I'm reading this right:

$oldAppInfo = array_merge($appInfo, ['version' => $exApp->getVersion()]);
$oldDeployParams = $this->dockerActions->buildDeployParams($daemonConfig, $oldAppInfo);
$oldImageName = $this->dockerActions->buildBaseImageName($oldDeployParams['image_params'], $daemonConfig);

$appInfo here comes from the NEW app manifest (fetched a few lines up). You override version to the currently-installed one. But buildDeployParams and buildBaseImageName also consume $appInfo['image_src'], $appInfo['image_name'], and the daemon's getDeployConfig()['registries'] mapping.

Can you describe, in terms of what $oldImageName ends up being, the outcome in these two scenarios?

(a) ExApp author shipped v1.0 as ghcr.io/me/app and later retagged v1.1 to ghcr.io/org/app in the new manifest. Admin upgrades.

(b) Admin added a new registries entry rewriting ghcr.io/me/* to a local mirror between installing v1.0 and updating to v1.1.

In each case: what image ref is passed into DELETE /images/<name>, what does Docker actually do with it, and what ends up on disk afterwards? I would like to understand this before we ship because both are realistic.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this, actually I would say this is an implementation error.

(a) $oldImageName resolves to the new image rather than the old one because I am reconstructing it from the new manifest instead of using the data from the original install.

(b) The same problem exists here: both the registry mapping and the image tag are derived from the current configuration/new manifest, rather than the originally deployed metadata.

Summary: In both cases, there is no damage since consequently, docker returns a 409 error (image in use by the new container), and nothing is deleted, but no cleanup occurs.

private readonly LoggerInterface $logger,
) {
parent::__construct($time);
$intervalDays = $this->appConfig->getValueInt(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions on scheduling semantics - as we want to document these for the admin:

  1. Admin sets image_cleanup_interval_days = 0. Your code does max($intervalDays, 1) * SECONDS_PER_DAY --> setInterval(86400), and run() then no-ops via isCleanupDisabled(). End to end: does the job (a) not execute at all, or (b) execute every day and exit immediately inside run()? What does occ background-job:list show in each case?
  2. Admin starts with interval_days = 1, job runs at T, T+1d, T+2d, T+3d. At T+3.5d admin changes it to 7. When does the next execution fire - T+3.5d+7d, T+3d+7d, T+4d with an immediate no-op, or something else?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q1: It’s (b). The job runs daily and exits immediately via run(). Due to the max(0, 1) logic, the constructor sets a 24-hour interval even if interval_days = 0. Because Nextcloud TimedJobs are registered in info.xml, they cannot be dynamically unregistered to achieve (a). The overhead is minimal, but the job remains active in occ background-job:list.

Q2: The constructor re-evaluates the interval every cron cycle (every 5 mins). If changed to 7 days at T+3.5d:

Next cron: Updates to setInterval(7 days).

Next run: Scheduled for T+10d (Last run at T+3d + 7d interval).

The change applies nearly instantly, and the countdown resets relative to the last execution, not the setting change.

@aliz71
Copy link
Copy Markdown
Author

aliz71 commented Apr 19, 2026

To provide full transparency regarding my process: I implemented this solution based on my analysis of the codebase and the Docker API documentation, utilizing AI assistance to help navigate the specific project structure and Nextcloud framework conventions.

Given the constraints of the coding challenge, I did not have the opportunity to perform a deep dive into the exhaustive Nextcloud developer documentation or consult with the team on specific architectural preferences. Consequently, I focused on creating a functional, idiomatic implementation that aligns with the visible patterns in the existing repository.

@aliz71 aliz71 closed this Apr 20, 2026
@aliz71 aliz71 deleted the IS-667_docker_image_cleanup branch April 20, 2026 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatic clean-up of the outdated ExApp Docker images

2 participants