Skip to content

[GH-2700] Add 05-geopandas-on-spark notebook#2889

Open
jiayuasu wants to merge 3 commits intoapache:masterfrom
jiayuasu:geopandas-on-spark-notebook
Open

[GH-2700] Add 05-geopandas-on-spark notebook#2889
jiayuasu wants to merge 3 commits intoapache:masterfrom
jiayuasu:geopandas-on-spark-notebook

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented May 2, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Continues the docker-image notebook refresh series (issue #2700, milestone 1.9.1) and stacks two infrastructure fixes that this series exposed.

New shipped notebook — docs/usecases/05-geopandas-on-spark.ipynb

The sedona.spark.geopandas package mirrors the public GeoPandas API and runs on top of pyspark.pandas / Spark; the notebook walks the scale up your geopandas script with Sedona path end-to-end. Workflow on the Natural Earth countries shapefile already shipped with the docker image — no new data, no network:

  1. SedonaContext setup, including spark.sql.ansi.enabled=false (pyspark.pandas, the backend for sedona.spark.geopandas, refuses to start under Spark 4.x ANSI mode — see pyspark/pandas/utils.py:480-500, [PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE]).
  2. read_file(..., format="shapefile") — drop-in for geopandas.read_file.
  3. Vanilla GeoPandas idioms: boolean filtering by CONTINENT, .geometry, .centroid, .convex_hull, .area, .total_bounds.
  4. Voronoi catchments via SQL aggregator: ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))). Calls out that GeoSeries.voronoi_polygons() runs per row, which is wrong shape for "one diagram from many points".
  5. clip_by_rect(xmin, ymin, xmax, ymax) (new in 1.9) to crop the Voronoi result to a continental bbox.
  6. to_geopandas() round-trip + matplotlib for the final plot.
  7. <gdf>.spark.frame() to drop into SQL on the same dataframe — uses ST_DistanceSpheroid for "African countries' centroids closest to (0°N, 0°E)".

Notebook is structured as numbered markdown sections (## 1. through ## 7.), matching the convention from 01-mobility-pulse. Notebook intro flags **Requires Sedona ≥ 1.9.0.** explicitly because clip_by_rect and the autopopulated GeoParquet 1.1 covering metadata are 1.9-only.

Stacked CI / dockerfile fixes that this series exposed

  • .github/workflows/docker-build.yml — add docs/usecases/** to the path filter for both push and pull_request. The dockerfile bakes docs/usecases/*.ipynb, docs/usecases/*.py, and docs/usecases/data/ into the image, so notebook-only PRs ([GH-2700] Add 01-mobility-pulse notebook: vector analytics at TLC scale #2879, [GH-2700] Add 05-geopandas-on-spark notebook #2889 prior to this commit) silently bypassed the docker build + the docker/test-notebooks.sh harness in CI. With this change every notebook-affecting PR exercises both.
  • .github/workflows/docker-build.yml — drop the sedona: 1.8.0 matrix leg. The new notebooks use 1.9-only APIs (ST_BingTileAt, clip_by_rect, GeoParquet 1.1 covering metadata). The matrix legs build local images via --load — they never push to a registry — so dropping 1.8.0 has no effect on published artifacts. The remaining sedona: 'latest' leg covers what's current.
  • docker/sedona-docker.dockerfile — bump default ARG sedona_version 1.8.0 → 1.9.0 and ARG geotools_wrapper_version 1.8.1-33.1 → 1.9.0-33.5 so a plain docker build -f docker/sedona-docker.dockerfile . produces an image that runs the new notebooks. Matches the Maven coordinates already updated in the docs by [DOCS] Update Maven coordinates to Sedona 1.9.0 / geotools-wrapper 1.9.0-33.5 #2860.

How was this patch tested?

Local mirror of docker/test-notebooks.sh before every commit on this branch. Stack matched the docker image's runtime (Python 3.10, pyspark==4.0.1, apache-sedona==1.9.0, JDK 17, local[*], DRIVER_MEM=4g, Sedona JAR via PYSPARK_SUBMIT_ARGS Maven coords).

PASS  05-geopandas-on-spark  21s elapsed

Output sanity-checked: 54 African countries; Voronoi gives 54 cells totaling 43464 deg²; clip_by_rect preserves all 54; closest country to (0°N, 0°E) is São Tomé and Principe at 750.1 km — geographically correct; matplotlib figure renders Africa with the Voronoi overlay.

CI — with the path-filter fix in this PR, the Docker build workflow now triggers for this PR (run 25245483051 queued at push time), so docker build and the full test-notebooks.sh harness — 00-quickstart, 01-mobility-pulse, 05-geopandas-on-spark — run in the apache/sedona CI for the first time on a notebook-only change. This PR is what proves that wiring works.

Did this PR include necessary documentation updates?

  • The notebook is itself the documentation; intro markdown calls out **Requires Sedona ≥ 1.9.0.** so users on older docker images see the constraint.
  • docs/usecases/data/README.md already enumerates the Natural Earth provenance for the data this notebook reads (added in [GH-2700] Add 01-mobility-pulse notebook: vector analytics at TLC scale #2879). No additional updates required since this notebook ships no new data.
  • No public API changes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new docker-shipped example notebook demonstrating how to run GeoPandas-style workflows on Spark via sedona.spark.geopandas, using the Natural Earth countries shapefile already bundled with the image (offline / no new data).

Changes:

  • Introduces 05-geopandas-on-spark.ipynb with a numbered, end-to-end workflow: load shapefile via read_file, do GeoPandas idioms, compute Voronoi via SQL aggregation, clip, round-trip to GeoPandas for plotting, and drop into SQL for extra functions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"\n",
"> **What does it look like to take a typical GeoPandas script and run it on Sedona?**\n",
"\n",
"Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`, `concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`, `to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't have what you need. Data is the Natural Earth countries shapefile already shipped with the docker image; no network required."
Comment on lines +171 to +173
"africa_bbox = (-20.0, -36.0, 52.0, 38.0)\n",
"clipped = voronoi_cells.clip_by_rect(*africa_bbox)\n",
"print(f\"{len(clipped)} Voronoi cells after clip_by_rect\")\n",
"source": [
"## 7. Drop into SQL whenever you need a function the API doesn't expose\n",
"\n",
"`<gdf>.spark.frame()` returns the underlying Spark DataFrame, so the entire `ST_*` SQL catalog is one `createOrReplaceTempView` away. Here we ask which African capitals are closest to (0°N, 0°E) using `ST_DistanceSpheroid` (great-circle distance in metres), without leaving the data we already loaded with the geopandas API."
"outputs": [],
"source": [
"from shapely.wkt import loads as wkt_loads\n",
"from shapely.geometry import shape\n",
Comment on lines +54 to +58
" .master(\"spark://localhost:7077\")\n",
" .config(\"spark.sql.ansi.enabled\", \"false\")\n",
" .getOrCreate()\n",
")\n",
"sedona = SedonaContext.create(config)"
@jiayuasu
Copy link
Copy Markdown
Member Author

jiayuasu commented May 2, 2026

Pushed 856d17a081.

Review point Action
Intro lists concave_hull / to_crs / voronoi_polygons not used Trimmed to the methods actually called: convex_hull, clip_by_rect, total_bounds, to_geopandas. Voronoi is now described as the SQL-fallback path it actually is.
clip_by_rect is 1.9-only — make version requirement explicit Added **Requires Sedona ≥ 1.9.0.** line to the intro.
Section 7 says "African capitals" but uses ST_Centroid(geometry) of country polygons Reworded to "African countries' centroids closest to (0°N, 0°E)". The data only contains country polygons, so centroid-of-country is the correct framing.
Unused from shapely.geometry import shape Removed.
Suspected IndentationError on the SedonaContext line The sedona = SedonaContext.create(config) line is at column 0 in the cell source (not indented inside the (...) block). The notebook executes cleanly end-to-end — verified by re-running it through the local mirror of docker/test-notebooks.sh (matched docker stack: Python 3.10, pyspark 4.0.1, sedona 1.9.0, JDK 17, local[*], DRIVER_MEM=4g). Result: PASS 05-geopandas-on-spark 21s elapsed. No code change.

Re-verified end-to-end after every edit; output unchanged (54 African countries, 54 Voronoi cells before and after clip_by_rect, São Tomé closest to (0,0) at 750.1 km).

…t Sedona to 1.9.0

Two related fixes that this PR series exposed:

1. .github/workflows/docker-build.yml only triggered on changes to
   docker/** or the workflow file itself. But the dockerfile bakes
   docs/usecases/*.ipynb / *.py / data into the image, so notebook-only
   PRs (apache#2879, apache#2889) silently bypassed the docker build + the
   test-notebooks.sh harness in CI. Adds 'docs/usecases/**' to the
   trigger paths so any change that affects what ships in the image
   also runs the build.

2. Drop the 'sedona: 1.8.0' matrix leg. The new notebooks (00, 01, 05)
   use 1.9-only APIs (ST_BingTileAt, clip_by_rect, GeoParquet 1.1
   covering metadata). The 'latest' leg already covers what's current.
   The matrix legs build local images via `--load`, never push to a
   registry, so dropping 1.8.0 has no effect on published artifacts.

3. Bump dockerfile default ARGs sedona_version 1.8.0 -> 1.9.0 and
   geotools_wrapper_version 1.8.1-33.1 -> 1.9.0-33.5 so a plain
   `docker build -f docker/sedona-docker.dockerfile .` produces an
   image that runs the new notebooks. Matches the Maven coordinates
   already updated in the docs by apache#2860.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants