Conversation
44a5522 to
bc0d4e4
Compare
|
what exactly the query that failed in spark? I checked DF corr and PGQL corr works the same. PGSQL |
Thanks @comphead Just to double check, you haven't enabled Comet for Spark? |
|
I have some feeling Comet is not using DF based corr and uses its own implementation More correct behavior is to delegate this to DF like for count |
comphead
left a comment
There was a problem hiding this comment.
Thanks @kazuyukitanimura I think we need to try to delegate corr to DF correlation::corr_udaf() and remove Comet old implementation
parthchandra
left a comment
There was a problem hiding this comment.
lgtm. Suggestions are non-blocking
| CREATE TABLE test_corr_nan(x double, y double, grp string) USING parquet | ||
|
|
||
| statement | ||
| INSERT INTO test_corr_nan VALUES (cast('NaN' as double), cast('NaN' as double), 'both_nan'), (cast('NaN' as double), 1.0, 'nan_val'), (1.0, cast('NaN' as double), 'val_nan'), (NULL, cast('NaN' as double), 'null_nan'), (cast('NaN' as double), NULL, 'nan_null'), (NULL, NULL, 'both_null'), (NULL, 1.0, 'null_val'), (1.0, NULL, 'val_null') |
There was a problem hiding this comment.
Maybe add a group with mixed nan and valid rows ( eg [(NaN, NaN), (1.0, 2.0), (3.0, 4.0)] )
|
|
||
| object CometCorr extends CometAggregateExpressionSerde[Corr] { | ||
|
|
||
| override def getSupportLevel(expr: Corr): SupportLevel = |
There was a problem hiding this comment.
Claude flagged some edge cases we can document -
▎ 1. Legacy mode: When spark.sql.legacy.statisticalAggregate=true, nullOnDivideByZero is false and Spark returns NaN for the n=1 case. With this workaround, Comet would return null instead (because the NaN row gets skipped → n=0). Should we add a getSupportLevel guard that returns Incompatible when
corr.nullOnDivideByZero is false? Or at least document this?
▎ 2. Mixed groups: For a group containing (NaN, NaN) alongside valid pairs like (1.0, 2.0), Spark returns NaN (NaN contaminates the accumulator), while this workaround would skip the NaN row and compute a valid correlation over the remaining rows. Is that a known limitation we're OK with?
Which issue does this PR close?
Closes #2646
Rationale for this change
This is a workaround for the behavior for #2646 (comment)
What changes are included in this PR?
When both inputs to
CorrareNaN, returnNullHow are these changes tested?
Added tests