Skip to content

Perf optimization#1

Merged
wrignj08 merged 18 commits intomainfrom
perf-optimization
May 4, 2026
Merged

Perf optimization#1
wrignj08 merged 18 commits intomainfrom
perf-optimization

Conversation

@wrignj08
Copy link
Copy Markdown
Contributor

@wrignj08 wrignj08 commented May 4, 2026

No description provided.

wrignj08 added 18 commits May 1, 2026 20:18
Performance: NLUM benchmark 85.6s -> 40.1s (53% faster). Public API
unchanged; outputs bit-identical to baseline on all bench scenarios.

Algorithmic changes (multiclean/utils.py, multiclean/multiclean.py):
* Replace float32 smoothed_labels (4 bytes/px) with uint8/uint16 code
  array (1-2 bytes/px). Per-class equality scan in build_invalid_mask
  is now 2-4x cheaper in memory bandwidth.
* Fold find_small_islands into build_invalid_mask, OR-reducing each
  per-class mask in flight. Replaces a Dict[int, ndarray] of K bool
  masks (~43 GB peak on NLUM) with a single shared invalid_mask.
* In-place fill in fill_invalids (no smoothed_labels.copy()).
* distance_transform_edt(return_distances=False) skips scipy's
  internal float64 distance allocation.
* (array == c).view(np.uint8) instead of .astype(np.uint8) -- no
  bool->uint8 conversion copy.
* Drop redundant np.isin scan in fill_invalids (every non-NaN pixel
  is already a valid class label by construction).

Edge-case fixes (also exercised by new tests):
* float64 / int64 / int32 > 2^24 inputs are now preserved exactly.
  The previous implementation routed everything through float32 and
  silently lost precision (and emitted RuntimeWarning) on large ints.
* All-NaN float input with fill_nan=True now deterministically returns
  NaN instead of relying on np.empty's uninitialised memory.

Tests (tests/test_multiclean.py):
* test_all_nan_fill_nan_true_is_deterministically_nan
* test_dtype_preserved_for_float64_and_large_ints
* test_many_classes_exercises_uint16_code_path
* test_subset_targets_leave_multiple_background_classes_untouched

Tooling:
* .pre-commit-config.yaml -- ruff-check, ruff-format, pytest pre-push.
* .github/workflows/ci.yml -- ruff + mypy + pytest on push/PR.
* pyproject.toml -- add mypy + pre-commit dev deps, [tool.mypy].
Locks in that clean_array(smooth_edge_size=3, min_island_size=5) on the
full Landsat cloud-and-cloud-shadow mask used by notebooks/Cloud
example.ipynb stays bit-identical to the output produced by the
original main-branch implementation.

The expected output (1.7 MB compressed, ~63 MB raw uint8) was generated
on main and lives at tests/data/landsat_expected.npz. The test loads
it back and asserts np.array_equal against the current implementation's
output. Skipped automatically if the fixture or input tif are missing.

Test runs in ~3 s on the full array (8011x7901, 4 classes).
Tests should own their fixtures. The cloud/shadow mask is the input
for both the regression test and the notebooks/Cloud example.ipynb
demo, so move it to tests/data/ as the canonical location and have
the notebook reach into it via Path().cwd().parent / "tests" / "data".

No code change; only file relocation and notebook path adjustment.
Reverts the notebook half of the prior move: the cloud/shadow tif now
lives in BOTH notebooks/data/ (so notebooks/Cloud example.ipynb stays
self-contained for users following the example) and tests/data/ (so
the regression test owns its own fixture and does not reach across
the repo). 3.2 MB duplicated, but each side stands alone.
Drops the scipy dependency entirely (was the only consumer) and makes the
nearest-neighbour fill ~3.4x faster on the Landsat cloud/shadow notebook
example (fill stage 2.05s -> 0.60s; whole clean_array 2.48s -> 1.07s,
a 2.19x end-to-end speedup).

Mathematically equivalent: cv2.distanceTransformWithLabels under
DIST_MASK_PRECISE returns an exact L2 nearest-source assignment. The
only difference vs scipy.ndimage.distance_transform_edt is the choice of
which equidistant source pixel wins a tie; both pick a source at the
identical minimum distance. On the Landsat fixture this affects ~90k
pixels out of 354k invalid pixels, all of which are tied-distance cases.

Approach is uniformly faster across image sizes >=512x512 (2.5-4.2x on
the fill stage, see edt_bench in the perf-optimization scratch space).
For sizes >=1024 with synthetic blocky data the bytes are even
identical because the layout has no tied-distance configurations.

Updated tests/data/landsat_expected.npz to lock in the cv2 tie-break
choices so the regression test stays a meaningful baseline going
forward.
Three additions targeting gaps the cv2 EDT swap exposed:

* test_fill_works_on_tiny_image_with_one_invalid_pixel -- cv2 has
  minimum-size requirements for some operations; pin a 4x4 array with a
  single invalid pixel through the full fill path.
* test_single_valid_pixel_propagates_to_all_invalid -- one valid source
  surrounded by an all-NaN grid (with fill_nan=True); also exercises
  the ``int(valid_labels.max()) + 1`` lookup-size calculation when the
  label table has exactly one real entry.
* Renamed test_landsat_cloud_shadow_matches_main_branch_output to
  test_landsat_cloud_shadow_regression -- the fixture was regenerated
  from this branch when we swapped in cv2, so "matches main branch
  output" is no longer accurate. New docstring documents how to
  regenerate the fixture if the implementation is ever deliberately
  changed again.
The Changelog follows Keep a Changelog format with the in-progress
optimisation work captured under [Unreleased]. Older entries (0.1.0
initial release, 0.2.0 fill_nan addition) are reconstructed from git
history.

Linked from the README via an absolute GitHub URL so it renders as a
working link on PyPI as well as GitHub. Also exposed via
``[project.urls] Changelog`` in pyproject.toml so package metadata
(and tools like ``uv tree``, GitHub repository sidebar, etc.) point at
the same canonical location, mirroring the iiq2img layout.
The ~43 GB figure was naive math (147 bool masks * 292 MB each), not
measured RSS. macOS memory compression (and the run-length structure of
bool masks) made the dict-of-K-masks approach much cheaper in resident
memory than that arithmetic suggested. Replace with the actually
measured peak RSS savings: 5.4 GB -> 2.8 GB on the 4-class Landsat
example, with smaller absolute savings on the 147-class raster.
Closes the [Unreleased] section in CHANGELOG with a 0.3.0 release dated
today. Headline changes: ~2x faster clean_array on Landsat-scale inputs,
strict input-dtype preservation (no more silent float32 round-trip),
and the scipy runtime dependency dropped (cv2 now handles the
nearest-valid fill).
* Drop SciPy from the "Fast: NumPy + OpenCV + SciPy" feature line --
  it's no longer a runtime dependency since the cv2-fill swap.
* Correct "elliptical kernel" -> "circular kernel" in the How It Works
  section. ``create_circle_kernel`` thresholds on Euclidean distance
  from centre, so the structuring element is circular, not elliptical.
* Tighten the API-Reference return-value note: the output now matches
  the input shape AND dtype (the prior wording was carried over from
  when float64 inputs got silently downcast to float32 internally).

Verified the Quick Start code block runs as written; all referenced
files (assets/land_use_before_after.png, the three notebooks) exist
in the tree; clean_array's signature matches the documented one
exactly (default values and parameter names).
Mirrors iiq2img's publish.yml. Triggers on tag pushes matching ``v*``,
builds sdist+wheel via ``uv build``, uploads them as a workflow artifact,
and uses ``pypa/gh-action-pypi-publish`` to push to PyPI under a GitHub
``pypi`` Environment.

No PyPI API token is stored; authentication is via OIDC trusted-
publisher config that has to be set up once on PyPI's side -- see
release notes added in this commit's accompanying README change.

Verified ``uv build`` produces a clean 10 KB wheel + 15 KB sdist with
correct 0.3.0 version, single ``opencv-python>=4.0`` runtime dep, and
the README + Project-URLs (Homepage / Repository / Changelog) in the
package metadata.
Adds an empty multiclean/py.typed marker file and tells setuptools to
ship it via [tool.setuptools.package-data]. With this, downstream
consumers' type-checkers (mypy, pyright, pylance, etc.) pick up the
inline type hints already present in clean_array and the utils module
instead of treating multiclean as untyped.

Verified the marker is present in the built wheel:
    multiclean/py.typed   (0 bytes)

Without this file, type-checkers default to ignoring annotations from
installed third-party packages even when the source is annotated.
Mirrors iiq2img: setuptools-scm reads the version from the most recent
``v*`` tag, so cutting a release becomes a single ``git tag`` step and
the ``multiclean/__version__.py`` file no longer needs hand-bumping.

Changes:
* pyproject.toml: ``setuptools-scm>=8`` added to build-system requires;
  the ``[tool.setuptools.dynamic]`` block is replaced by an empty
  ``[tool.setuptools_scm]`` block that enables the plugin.
* multiclean/__init__.py: ``__version__`` is now read at runtime via
  ``importlib.metadata.version("multiclean")`` so the in-package symbol
  always tracks what's actually installed.
* multiclean/__version__.py: deleted; no longer hand-maintained.

When a build runs from a clean tag (e.g. ``v0.3.0``), the wheel filename
and ``multiclean.__version__`` both resolve to ``0.3.0`` exactly. From
an untagged checkout setuptools-scm produces ``0.1.devN+g<sha>`` so it's
visually obvious the build wasn't from a release tag.
Captures the full cut-a-release sequence (update CHANGELOG, merge,
tag, approve deployment, verify on PyPI) plus the one-time PyPI/GH
setup so future-me / contributors don't have to reverse-engineer
publish.yml + the trusted-publisher config.

Linked from README under "Contributing" with an absolute GitHub URL
so it works on PyPI as well as GitHub.
@wrignj08 wrignj08 merged commit 42238d3 into main May 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant