[GH-2700] Add 05-geopandas-on-spark notebook by jiayuasu · Pull Request #2889 · apache/sedona

jiayuasu · 2026-05-02T05:42:11Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes part of Sedona example notebooks in the docker image are very out of date #2700.

What changes were proposed in this PR?

Continues the docker-image notebook refresh series (issue #2700, milestone 1.9.1) and stacks two infrastructure fixes that this series exposed.

New shipped notebook — docs/usecases/05-geopandas-on-spark.ipynb

The sedona.spark.geopandas package mirrors the public GeoPandas API and runs on top of pyspark.pandas / Spark; the notebook walks the scale up your geopandas script with Sedona path end-to-end. Workflow on the Natural Earth countries shapefile already shipped with the docker image — no new data, no network:

SedonaContext setup, including spark.sql.ansi.enabled=false (pyspark.pandas, the backend for sedona.spark.geopandas, refuses to start under Spark 4.x ANSI mode — see pyspark/pandas/utils.py:480-500, [PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE]).
read_file(..., format="shapefile") — drop-in for geopandas.read_file.
Vanilla GeoPandas idioms: boolean filtering by CONTINENT, .geometry, .centroid, .convex_hull, .area, .total_bounds.
Voronoi catchments via SQL aggregator: ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))). Calls out that GeoSeries.voronoi_polygons() runs per row, which is wrong shape for "one diagram from many points".
clip_by_rect(xmin, ymin, xmax, ymax) (new in 1.9) to crop the Voronoi result to a continental bbox.
to_geopandas() round-trip + matplotlib for the final plot.
<gdf>.spark.frame() to drop into SQL on the same dataframe — uses ST_DistanceSpheroid for "African countries' centroids closest to (0°N, 0°E)".

Notebook is structured as numbered markdown sections (## 1. through ## 7.), matching the convention from 01-mobility-pulse. Notebook intro flags **Requires Sedona ≥ 1.9.0.** explicitly because clip_by_rect and the autopopulated GeoParquet 1.1 covering metadata are 1.9-only.

Stacked CI / dockerfile fixes that this series exposed

.github/workflows/docker-build.yml — add docs/usecases/** to the path filter for both push and pull_request. The dockerfile bakes docs/usecases/*.ipynb, docs/usecases/*.py, and docs/usecases/data/ into the image, so notebook-only PRs ([GH-2700] Add 01-mobility-pulse notebook: vector analytics at TLC scale #2879, [GH-2700] Add 05-geopandas-on-spark notebook #2889 prior to this commit) silently bypassed the docker build + the docker/test-notebooks.sh harness in CI. With this change every notebook-affecting PR exercises both.
.github/workflows/docker-build.yml — drop the sedona: 1.8.0 matrix leg. The new notebooks use 1.9-only APIs (ST_BingTileAt, clip_by_rect, GeoParquet 1.1 covering metadata). The matrix legs build local images via --load — they never push to a registry — so dropping 1.8.0 has no effect on published artifacts. The remaining sedona: 'latest' leg covers what's current.
docker/sedona-docker.dockerfile — bump default ARG sedona_version 1.8.0 → 1.9.0 and ARG geotools_wrapper_version 1.8.1-33.1 → 1.9.0-33.5 so a plain docker build -f docker/sedona-docker.dockerfile . produces an image that runs the new notebooks. Matches the Maven coordinates already updated in the docs by [DOCS] Update Maven coordinates to Sedona 1.9.0 / geotools-wrapper 1.9.0-33.5 #2860.

How was this patch tested?

Local mirror of docker/test-notebooks.sh before every commit on this branch. Stack matched the docker image's runtime (Python 3.10, pyspark==4.0.1, apache-sedona==1.9.0, JDK 17, local[*], DRIVER_MEM=4g, Sedona JAR via PYSPARK_SUBMIT_ARGS Maven coords).

PASS  05-geopandas-on-spark  21s elapsed

Output sanity-checked: 54 African countries; Voronoi gives 54 cells totaling 43464 deg²; clip_by_rect preserves all 54; closest country to (0°N, 0°E) is São Tomé and Principe at 750.1 km — geographically correct; matplotlib figure renders Africa with the Voronoi overlay.

CI — with the path-filter fix in this PR, the Docker build workflow now triggers for this PR (run 25245483051 queued at push time), so docker build and the full test-notebooks.sh harness — 00-quickstart, 01-mobility-pulse, 05-geopandas-on-spark — run in the apache/sedona CI for the first time on a notebook-only change. This PR is what proves that wiring works.

Did this PR include necessary documentation updates?

The notebook is itself the documentation; intro markdown calls out **Requires Sedona ≥ 1.9.0.** so users on older docker images see the constraint.
docs/usecases/data/README.md already enumerates the Natural Earth provenance for the data this notebook reads (added in [GH-2700] Add 01-mobility-pulse notebook: vector analytics at TLC scale #2879). No additional updates required since this notebook ships no new data.
No public API changes.

Copilot

Pull request overview

Adds a new docker-shipped example notebook demonstrating how to run GeoPandas-style workflows on Spark via sedona.spark.geopandas, using the Natural Earth countries shapefile already bundled with the image (offline / no new data).

Changes:

Introduces 05-geopandas-on-spark.ipynb with a numbered, end-to-end workflow: load shapefile via read_file, do GeoPandas idioms, compute Voronoi via SQL aggregation, clip, round-trip to GeoPandas for plotting, and drop into SQL for extra functions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    "\n",
+    "> **What does it look like to take a typical GeoPandas script and run it on Sedona?**\n",
+    "\n",
+    "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`, `concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`, `to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't have what you need. Data is the Natural Earth countries shapefile already shipped with the docker image; no network required."


+    "africa_bbox = (-20.0, -36.0, 52.0, 38.0)\n",
+    "clipped = voronoi_cells.clip_by_rect(*africa_bbox)\n",
+    "print(f\"{len(clipped)} Voronoi cells after clip_by_rect\")\n",


+   "source": [
+    "## 7. Drop into SQL whenever you need a function the API doesn't expose\n",
+    "\n",
+    "`<gdf>.spark.frame()` returns the underlying Spark DataFrame, so the entire `ST_*` SQL catalog is one `createOrReplaceTempView` away. Here we ask which African capitals are closest to (0°N, 0°E) using `ST_DistanceSpheroid` (great-circle distance in metres), without leaving the data we already loaded with the geopandas API."


+   "outputs": [],
+   "source": [
+    "from shapely.wkt import loads as wkt_loads\n",
+    "from shapely.geometry import shape\n",


+    "    .master(\"spark://localhost:7077\")\n",
+    "    .config(\"spark.sql.ansi.enabled\", \"false\")\n",
+    "    .getOrCreate()\n",
+    ")\n",
+    "sedona = SedonaContext.create(config)"


…pitals→countries, unused import, version)

jiayuasu · 2026-05-02T06:00:00Z

Pushed 856d17a081.

Review point	Action
Intro lists `concave_hull` / `to_crs` / `voronoi_polygons` not used	Trimmed to the methods actually called: `convex_hull`, `clip_by_rect`, `total_bounds`, `to_geopandas`. Voronoi is now described as the SQL-fallback path it actually is.
`clip_by_rect` is 1.9-only — make version requirement explicit	Added `Requires Sedona ≥ 1.9.0.` line to the intro.
Section 7 says "African capitals" but uses `ST_Centroid(geometry)` of country polygons	Reworded to "African countries' centroids closest to (0°N, 0°E)". The data only contains country polygons, so centroid-of-country is the correct framing.
Unused `from shapely.geometry import shape`	Removed.
Suspected `IndentationError` on the SedonaContext line	The `sedona = SedonaContext.create(config)` line is at column 0 in the cell source (not indented inside the `(...)` block). The notebook executes cleanly end-to-end — verified by re-running it through the local mirror of `docker/test-notebooks.sh` (matched docker stack: Python 3.10, pyspark 4.0.1, sedona 1.9.0, JDK 17, `local[*]`, `DRIVER_MEM=4g`). Result: `PASS 05-geopandas-on-spark 21s elapsed`. No code change.

Re-verified end-to-end after every edit; output unchanged (54 African countries, 54 Voronoi cells before and after clip_by_rect, São Tomé closest to (0,0) at 750.1 km).

…t Sedona to 1.9.0 Two related fixes that this PR series exposed: 1. .github/workflows/docker-build.yml only triggered on changes to docker/** or the workflow file itself. But the dockerfile bakes docs/usecases/*.ipynb / *.py / data into the image, so notebook-only PRs (apache#2879, apache#2889) silently bypassed the docker build + the test-notebooks.sh harness in CI. Adds 'docs/usecases/**' to the trigger paths so any change that affects what ships in the image also runs the build. 2. Drop the 'sedona: 1.8.0' matrix leg. The new notebooks (00, 01, 05) use 1.9-only APIs (ST_BingTileAt, clip_by_rect, GeoParquet 1.1 covering metadata). The 'latest' leg already covers what's current. The matrix legs build local images via `--load`, never push to a registry, so dropping 1.8.0 has no effect on published artifacts. 3. Bump dockerfile default ARGs sedona_version 1.8.0 -> 1.9.0 and geotools_wrapper_version 1.8.1-33.1 -> 1.9.0-33.5 so a plain `docker build -f docker/sedona-docker.dockerfile .` produces an image that runs the new notebooks. Matches the Maven coordinates already updated in the docs by apache#2860.

[apacheGH-2700] Add 05-geopandas-on-spark.ipynb

92a55c3

jiayuasu requested a review from Copilot May 2, 2026 05:48

Copilot started reviewing on behalf of jiayuasu May 2, 2026 05:49 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

[apacheGH-2700] 05-geopandas-on-spark: address review (intro list, ca…

856d17a

…pitals→countries, unused import, version)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GH-2700] Add 05-geopandas-on-spark notebook#2889

[GH-2700] Add 05-geopandas-on-spark notebook#2889
jiayuasu wants to merge 3 commits intoapache:masterfrom
jiayuasu:geopandas-on-spark-notebook

jiayuasu commented May 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

jiayuasu commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jiayuasu commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

jiayuasu commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiayuasu commented May 2, 2026 •

edited

Loading