Skip to content

[AURON #2154] Implement native function of str_to_map#2190

Open
weimingdiit wants to merge 1 commit intoapache:masterfrom
weimingdiit:feat/str_to_map_function
Open

[AURON #2154] Implement native function of str_to_map#2190
weimingdiit wants to merge 1 commit intoapache:masterfrom
weimingdiit:feat/str_to_map_function

Conversation

@weimingdiit
Copy link
Copy Markdown
Contributor

@weimingdiit weimingdiit commented Apr 10, 2026

Which issue does this PR close?

Closes #2154

Rationale for this change

This PR implements native support for Spark str_to_map(text[, pairDelim[, keyValueDelim]]) in Auron.

In addition to adding native execution for the function itself, this change aligns delimiter handling with Spark semantics by evaluating pairDelim and keyValueDelim with Java regex behavior instead of Rust regex behavior. This is important because Spark’s StringToMap semantics are based on Java regex splitting, and some valid Spark patterns, such as look-around expressions, are not supported by Rust’s regex engine.

Motivation

str_to_map is a commonly used Spark function for constructing maps from delimited strings. Before this change, Auron did not support it natively.

A straightforward Rust implementation can handle simple regex delimiters, but it does not fully match Spark semantics because Spark uses Java regex behavior for both delimiters. That difference can lead to incorrect splitting or execution errors for regex patterns that are valid in Spark but unsupported in Rust regex.

This PR addresses both gaps:

  • it adds native support for str_to_map
  • it makes delimiter splitting follow Spark-compatible Java regex semantics

What changes are included in this PR?

This PR:

  • adds native conversion for Spark StringToMap expressions in NativeConverters
  • registers a new native function entry point for Spark_StrToMap
  • implements native str_to_map evaluation in datafusion-ext-functions
  • propagates nulls consistently with Spark semantics
  • applies pairDelim using Java regex split(..., -1)
  • applies keyValueDelim using Java regex split(..., 2)
  • preserves Spark behavior where a missing value becomes null
  • preserves Spark duplicate-key handling via spark.sql.mapKeyDedupPolicy
  • adds a JVM bridge helper so delimiter splitting uses Java Pattern semantics instead of Rust regex semantics
  • adds integration coverage in AuronFunctionSuite for standard cases, regex delimiters, Java-regex-specific delimiters, duplicate keys, and LAST_WIN dedup policy

Why this design?

The main design choice in this PR is to use Java regex splitting through the existing JNI bridge instead of relying only on Rust’s regex crate.

This was chosen because Spark semantics for str_to_map are defined by Java regex behavior. Using Java Pattern.split avoids semantic mismatches for constructs such as look-around and other Java-regex-specific behavior. That gives better correctness and makes native str_to_map behavior match Spark more closely.

The native side still owns the rest of the function logic, including:

  • row-wise null propagation
  • duplicate-key handling
  • map construction in Arrow/native format

This keeps the Spark-specific regex semantics where they belong while preserving the native execution path for the rest of the work.

How was this patch tested?

CI.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds native execution support for Spark SQL str_to_map(text[, pairDelim[, keyValueDelim]]) via Auron’s extension-function mechanism, enabling this common map-construction function to run in the native engine with Spark-aligned null + dedup-policy behavior.

Changes:

  • Added Spark plan conversion for StringToMap to route through new native ext function Spark_StrToMap, including passing spark.sql.mapKeyDedupPolicy.
  • Implemented str_to_map in datafusion-ext-functions and registered it in the native extension-function registry.
  • Added Scala regression tests and Rust unit tests for core semantics (defaults, regex delimiters, null propagation, duplicate-key policy).

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeConverters.scala Routes Spark StringToMap expression to native Spark_StrToMap and passes dedup policy.
spark-extension-shims-spark/src/test/scala/org/apache/auron/AuronFunctionSuite.scala Adds native-vs-Spark regression tests for str_to_map semantics and dedup behavior.
native-engine/datafusion-ext-functions/src/spark_map.rs Implements native str_to_map and adds Rust unit tests.
native-engine/datafusion-ext-functions/src/lib.rs Registers Spark_StrToMap in the extension-function factory.
native-engine/datafusion-ext-functions/Cargo.toml Adds regex dependency for delimiter splitting.
Cargo.lock Updates lockfile for the new regex dependency/version.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@weimingdiit weimingdiit force-pushed the feat/str_to_map_function branch from e252e23 to 5759e52 Compare April 12, 2026 05:23
@github-actions github-actions bot added the core label Apr 12, 2026
@weimingdiit weimingdiit force-pushed the feat/str_to_map_function branch 2 times, most recently from 70a345a to a846079 Compare April 12, 2026 06:55
Signed-off-by: weimingdiit <weimingdiit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement native function of str_to_map

2 participants