[AURON #2154] Implement native function of str_to_map#2190
Open
weimingdiit wants to merge 1 commit intoapache:masterfrom
Open
[AURON #2154] Implement native function of str_to_map#2190weimingdiit wants to merge 1 commit intoapache:masterfrom
weimingdiit wants to merge 1 commit intoapache:masterfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds native execution support for Spark SQL str_to_map(text[, pairDelim[, keyValueDelim]]) via Auron’s extension-function mechanism, enabling this common map-construction function to run in the native engine with Spark-aligned null + dedup-policy behavior.
Changes:
- Added Spark plan conversion for
StringToMapto route through new native ext functionSpark_StrToMap, including passingspark.sql.mapKeyDedupPolicy. - Implemented
str_to_mapindatafusion-ext-functionsand registered it in the native extension-function registry. - Added Scala regression tests and Rust unit tests for core semantics (defaults, regex delimiters, null propagation, duplicate-key policy).
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeConverters.scala | Routes Spark StringToMap expression to native Spark_StrToMap and passes dedup policy. |
| spark-extension-shims-spark/src/test/scala/org/apache/auron/AuronFunctionSuite.scala | Adds native-vs-Spark regression tests for str_to_map semantics and dedup behavior. |
| native-engine/datafusion-ext-functions/src/spark_map.rs | Implements native str_to_map and adds Rust unit tests. |
| native-engine/datafusion-ext-functions/src/lib.rs | Registers Spark_StrToMap in the extension-function factory. |
| native-engine/datafusion-ext-functions/Cargo.toml | Adds regex dependency for delimiter splitting. |
| Cargo.lock | Updates lockfile for the new regex dependency/version. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e252e23 to
5759e52
Compare
70a345a to
a846079
Compare
Signed-off-by: weimingdiit <weimingdiit@gmail.com>
a846079 to
7bee0f6
Compare
26 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #2154
Rationale for this change
This PR implements native support for Spark
str_to_map(text[, pairDelim[, keyValueDelim]])in Auron.In addition to adding native execution for the function itself, this change aligns delimiter handling with Spark semantics by evaluating
pairDelimandkeyValueDelimwith Java regex behavior instead of Rust regex behavior. This is important because Spark’sStringToMapsemantics are based on Java regex splitting, and some valid Spark patterns, such as look-around expressions, are not supported by Rust’s regex engine.Motivation
str_to_mapis a commonly used Spark function for constructing maps from delimited strings. Before this change, Auron did not support it natively.A straightforward Rust implementation can handle simple regex delimiters, but it does not fully match Spark semantics because Spark uses Java regex behavior for both delimiters. That difference can lead to incorrect splitting or execution errors for regex patterns that are valid in Spark but unsupported in Rust regex.
This PR addresses both gaps:
str_to_mapWhat changes are included in this PR?
This PR:
StringToMapexpressions inNativeConvertersSpark_StrToMapstr_to_mapevaluation indatafusion-ext-functionspairDelimusing Java regexsplit(..., -1)keyValueDelimusing Java regexsplit(..., 2)nullspark.sql.mapKeyDedupPolicyPatternsemantics instead of Rust regex semanticsAuronFunctionSuitefor standard cases, regex delimiters, Java-regex-specific delimiters, duplicate keys, andLAST_WINdedup policyWhy this design?
The main design choice in this PR is to use Java regex splitting through the existing JNI bridge instead of relying only on Rust’s regex crate.
This was chosen because Spark semantics for
str_to_mapare defined by Java regex behavior. Using JavaPattern.splitavoids semantic mismatches for constructs such as look-around and other Java-regex-specific behavior. That gives better correctness and makes nativestr_to_mapbehavior match Spark more closely.The native side still owns the rest of the function logic, including:
This keeps the Spark-specific regex semantics where they belong while preserving the native execution path for the rest of the work.
How was this patch tested?
CI.