[GH-2877] Add Box2D type and Box2DUDT#2878
Conversation
Introduces a planar bounding-box value type backed by a struct UDT (struct<xmin,ymin,xmax,ymax>, all double, non-nullable) so values round-trip natively to Parquet and align with GeoParquet 1.1 bbox covering columns. Empty boxes are encoded as xmin > xmax (JTS Envelope convention), making union/expand a no-op against empty. This change adds only the type and its registration. Functions (ST_Box2D, ST_MakeBox2D, ST_Extent, accessor overloads, casts) follow in subsequent commits per the plan in apache#2877.
There was a problem hiding this comment.
Pull request overview
Adds a new JVM/Spark-native planar bounding-box value type (Box2D) and a Spark UDT (Box2DUDT) as groundwork for bbox-related SQL functions and GeoParquet bbox covering-column interoperability (per GH-2877).
Changes:
- Introduce
Box2D(Java) with empty-box semantics and basic conversions (Envelope/Polygon). - Add struct-backed
Box2DUDT(struct<xmin,ymin,xmax,ymax>doubles) for Spark SQL serialization/deserialization. - Register
Box2D↔Box2DUDTinUdtRegistratorWrapper.registerAll().
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| common/src/main/java/org/apache/sedona/common/geometryObjects/Box2D.java | New planar bbox value type with empty/union helpers and conversion utilities. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/Box2DUDT.scala | New struct-backed Spark UDT for Box2D, including JSON schema support. |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/UDT/UdtRegistratorWrapper.scala | Registers the new Box2D UDT mapping alongside existing Sedona UDTs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Mirrors the JVM Box2DUDT so a Box2D column materialized in PySpark (e.g. via a JVM-created DataFrame) resolves to the matching Python type. Round-trips through the struct sqlType cleanly, including the empty-box encoding (xmin > xmax || ymin > ymax).
Test coverage: UDT registration, JSON schema round-trip, Box2D serde round-trip (including empty), case-object equality, Parquet write/read of a Box2D column. Javadoc on Box2D updated to match isEmpty() (xmin > xmax || ymin > ymax), not just xmin > xmax.
Drops the in-band 'xmin > xmax' empty marker. A Box2D is now always a valid finite bbox; absence (bbox of empty geometry, extent over zero rows) is represented by SQL NULL at the column level. This matches PostGIS behavior (where Box2D(EMPTY) returns NULL) and leaves xmin > xmax free for a future antimeridian-wraparound semantics on geography bboxes (cf. sedona-db's WraparoundInterval, S2's S2LatLngRect). Drops Box2D.empty() / isEmpty() and the Python equivalents. The expandToInclude(null) no-op is preserved so aggregation buffers can fold over a stream of geometries that may produce null bboxes.
|
@zhangfengcdt @paleolimbot what do you think of this? |
paleolimbot
left a comment
There was a problem hiding this comment.
I can't speak to the Spark details but the definition looks good to me!
It also matches the GeoArrow naming and definition of the box type, which is what we'll match this with in SedonaDB: https://geoarrow.org/format.html#box
|
Would be clearer if we just name it BOXUDT, with the possibility to extend to z and m dimensions later on (if needed)? The parquet bbox does not limit to 2D scenario. |
@zhangfengcdt Yes, there will be a BOX3D type. This is to maintain compatibility with PostGIS box2d and box3d |
Got it, make sense to me. |
fromEnvelope(Envelope) and toEnvelope() are not used by the Phase 1 SQL surface (ST_Box2D, ST_MakeBox2D, ST_Extent, accessors, CAST AS geometry, ST_AsText). Removing them in line with the PostGIS box function set we're targeting.
The polygon conversion is only needed by CAST(box2d AS geometry), which lands with the function PR. Dropping until then keeps Box2D as pure data plus the Geometry intake (fromGeometry) and the merge primitive (expandToInclude) that ST_Extent needs. Removes Polygon, Coordinate, GeometryFactory imports.
Did you read the Contributor Guide?
Is this PR related to a ticket?
What changes were proposed in this PR?
Adds the `Box2D` value type and its UDT, the foundation for the bbox work tracked in #2877. Functions (`ST_Box2D`, `ST_MakeBox2D`, `ST_Extent`, accessor overloads, casts) follow in subsequent PRs.
Field names (`xmin/ymin/xmax/ymax`) match the GeoParquet 1.1 spec and `apache/sedona-db`'s GeoParquet writer for direct cross-engine interop.
How was this patch tested?
`Box2DUDTSuite` (new) covers:
Python `Box2DType` was smoke-tested locally for `serialize` / `deserialize` round-trip and `scalaUDT` linkage. Function-level Python tests arrive with the function PRs that introduce constructors.
Did this PR include necessary documentation updates?