[GH-2700] Add 05-geopandas-on-spark notebook#2889
Open
jiayuasu wants to merge 3 commits intoapache:masterfrom
Open
[GH-2700] Add 05-geopandas-on-spark notebook#2889jiayuasu wants to merge 3 commits intoapache:masterfrom
jiayuasu wants to merge 3 commits intoapache:masterfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new docker-shipped example notebook demonstrating how to run GeoPandas-style workflows on Spark via sedona.spark.geopandas, using the Natural Earth countries shapefile already bundled with the image (offline / no new data).
Changes:
- Introduces
05-geopandas-on-spark.ipynbwith a numbered, end-to-end workflow: load shapefile viaread_file, do GeoPandas idioms, compute Voronoi via SQL aggregation, clip, round-trip to GeoPandas for plotting, and drop into SQL for extra functions.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "\n", | ||
| "> **What does it look like to take a typical GeoPandas script and run it on Sedona?**\n", | ||
| "\n", | ||
| "Along the way we use methods that landed in 1.8 / 1.9 — `convex_hull`, `concave_hull`, `voronoi_polygons`, `clip_by_rect`, `to_crs`, `total_bounds`, `to_geopandas` — and show how to drop into SQL when GeoPandas's API doesn't have what you need. Data is the Natural Earth countries shapefile already shipped with the docker image; no network required." |
Comment on lines
+171
to
+173
| "africa_bbox = (-20.0, -36.0, 52.0, 38.0)\n", | ||
| "clipped = voronoi_cells.clip_by_rect(*africa_bbox)\n", | ||
| "print(f\"{len(clipped)} Voronoi cells after clip_by_rect\")\n", |
| "source": [ | ||
| "## 7. Drop into SQL whenever you need a function the API doesn't expose\n", | ||
| "\n", | ||
| "`<gdf>.spark.frame()` returns the underlying Spark DataFrame, so the entire `ST_*` SQL catalog is one `createOrReplaceTempView` away. Here we ask which African capitals are closest to (0°N, 0°E) using `ST_DistanceSpheroid` (great-circle distance in metres), without leaving the data we already loaded with the geopandas API." |
| "outputs": [], | ||
| "source": [ | ||
| "from shapely.wkt import loads as wkt_loads\n", | ||
| "from shapely.geometry import shape\n", |
Comment on lines
+54
to
+58
| " .master(\"spark://localhost:7077\")\n", | ||
| " .config(\"spark.sql.ansi.enabled\", \"false\")\n", | ||
| " .getOrCreate()\n", | ||
| ")\n", | ||
| "sedona = SedonaContext.create(config)" |
…pitals→countries, unused import, version)
Member
Author
|
Pushed
Re-verified end-to-end after every edit; output unchanged (54 African countries, 54 Voronoi cells before and after |
…t Sedona to 1.9.0 Two related fixes that this PR series exposed: 1. .github/workflows/docker-build.yml only triggered on changes to docker/** or the workflow file itself. But the dockerfile bakes docs/usecases/*.ipynb / *.py / data into the image, so notebook-only PRs (apache#2879, apache#2889) silently bypassed the docker build + the test-notebooks.sh harness in CI. Adds 'docs/usecases/**' to the trigger paths so any change that affects what ships in the image also runs the build. 2. Drop the 'sedona: 1.8.0' matrix leg. The new notebooks (00, 01, 05) use 1.9-only APIs (ST_BingTileAt, clip_by_rect, GeoParquet 1.1 covering metadata). The 'latest' leg already covers what's current. The matrix legs build local images via `--load`, never push to a registry, so dropping 1.8.0 has no effect on published artifacts. 3. Bump dockerfile default ARGs sedona_version 1.8.0 -> 1.9.0 and geotools_wrapper_version 1.8.1-33.1 -> 1.9.0-33.5 so a plain `docker build -f docker/sedona-docker.dockerfile .` produces an image that runs the new notebooks. Matches the Maven coordinates already updated in the docs by apache#2860.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes part of Sedona example notebooks in the docker image are very out of date #2700.What changes were proposed in this PR?
Continues the docker-image notebook refresh series (issue #2700, milestone 1.9.1) and stacks two infrastructure fixes that this series exposed.
New shipped notebook —
docs/usecases/05-geopandas-on-spark.ipynbThe
sedona.spark.geopandaspackage mirrors the public GeoPandas API and runs on top ofpyspark.pandas/ Spark; the notebook walks the scale up your geopandas script with Sedona path end-to-end. Workflow on the Natural Earth countries shapefile already shipped with the docker image — no new data, no network:spark.sql.ansi.enabled=false(pyspark.pandas, the backend forsedona.spark.geopandas, refuses to start under Spark 4.x ANSI mode — seepyspark/pandas/utils.py:480-500,[PANDAS_API_ON_SPARK_FAIL_ON_ANSI_MODE]).read_file(..., format="shapefile")— drop-in forgeopandas.read_file.CONTINENT,.geometry,.centroid,.convex_hull,.area,.total_bounds.ST_VoronoiPolygons(ST_Collect_Agg(ST_Centroid(geometry))). Calls out thatGeoSeries.voronoi_polygons()runs per row, which is wrong shape for "one diagram from many points".clip_by_rect(xmin, ymin, xmax, ymax)(new in 1.9) to crop the Voronoi result to a continental bbox.to_geopandas()round-trip +matplotlibfor the final plot.<gdf>.spark.frame()to drop into SQL on the same dataframe — usesST_DistanceSpheroidfor "African countries' centroids closest to (0°N, 0°E)".Notebook is structured as numbered markdown sections (
## 1.through## 7.), matching the convention from01-mobility-pulse. Notebook intro flags**Requires Sedona ≥ 1.9.0.**explicitly becauseclip_by_rectand the autopopulated GeoParquet 1.1 covering metadata are 1.9-only.Stacked CI / dockerfile fixes that this series exposed
.github/workflows/docker-build.yml— adddocs/usecases/**to the path filter for bothpushandpull_request. The dockerfile bakesdocs/usecases/*.ipynb,docs/usecases/*.py, anddocs/usecases/data/into the image, so notebook-only PRs ([GH-2700] Add 01-mobility-pulse notebook: vector analytics at TLC scale #2879, [GH-2700] Add 05-geopandas-on-spark notebook #2889 prior to this commit) silently bypassed the docker build + thedocker/test-notebooks.shharness in CI. With this change every notebook-affecting PR exercises both..github/workflows/docker-build.yml— drop thesedona: 1.8.0matrix leg. The new notebooks use 1.9-only APIs (ST_BingTileAt,clip_by_rect, GeoParquet 1.1 covering metadata). The matrix legs build local images via--load— they never push to a registry — so dropping 1.8.0 has no effect on published artifacts. The remainingsedona: 'latest'leg covers what's current.docker/sedona-docker.dockerfile— bump defaultARG sedona_version1.8.0 → 1.9.0 andARG geotools_wrapper_version1.8.1-33.1 → 1.9.0-33.5 so a plaindocker build -f docker/sedona-docker.dockerfile .produces an image that runs the new notebooks. Matches the Maven coordinates already updated in the docs by [DOCS] Update Maven coordinates to Sedona 1.9.0 / geotools-wrapper 1.9.0-33.5 #2860.How was this patch tested?
Local mirror of
docker/test-notebooks.shbefore every commit on this branch. Stack matched the docker image's runtime (Python 3.10,pyspark==4.0.1,apache-sedona==1.9.0, JDK 17,local[*],DRIVER_MEM=4g, Sedona JAR viaPYSPARK_SUBMIT_ARGSMaven coords).Output sanity-checked: 54 African countries; Voronoi gives 54 cells totaling 43464 deg²;
clip_by_rectpreserves all 54; closest country to (0°N, 0°E) is São Tomé and Principe at 750.1 km — geographically correct; matplotlib figure renders Africa with the Voronoi overlay.CI — with the path-filter fix in this PR, the Docker build workflow now triggers for this PR (run
25245483051queued at push time), sodocker buildand the fulltest-notebooks.shharness —00-quickstart,01-mobility-pulse,05-geopandas-on-spark— run in the apache/sedona CI for the first time on a notebook-only change. This PR is what proves that wiring works.Did this PR include necessary documentation updates?
**Requires Sedona ≥ 1.9.0.**so users on older docker images see the constraint.docs/usecases/data/README.mdalready enumerates the Natural Earth provenance for the data this notebook reads (added in [GH-2700] Add 01-mobility-pulse notebook: vector analytics at TLC scale #2879). No additional updates required since this notebook ships no new data.