Bug
VerificationResult.rowLevelResultsAsDataFrame() returns all true Boolean values for row-level check columns when the input DataFrame has multiple partitions (e.g., read from a Delta table with spark.table()).
Reproduction
- Read a DataFrame from a persisted Delta table (~200K rows, multiple partitions)
- Run a VerificationSuite with
isUnique constraint — aggregate result correctly reports uniqueness < 1.0
- Call
VerificationResult.rowLevelResultsAsDataFrame(spark, result, data)
- The Boolean column is
true for all rows — no rows are flagged as failures
A minimal in-memory DataFrame (e.g., 4 rows with spark.createDataFrame) works correctly.
Expected behavior
Duplicate rows should have false in the row-level Boolean column.
Workaround
Calling df.repartition(1) before passing to the VerificationSuite produces correct row-level results. This suggests a row ordering / partition alignment issue in the underlying Deequ JVM method.
Related
Environment
- Spark 3.5 (Microsoft Fabric)
- pydeequ (latest)
- Delta Lake table source
Bug
VerificationResult.rowLevelResultsAsDataFrame()returns alltrueBoolean values for row-level check columns when the input DataFrame has multiple partitions (e.g., read from a Delta table withspark.table()).Reproduction
isUniqueconstraint — aggregate result correctly reports uniqueness < 1.0VerificationResult.rowLevelResultsAsDataFrame(spark, result, data)truefor all rows — no rows are flagged as failuresA minimal in-memory DataFrame (e.g., 4 rows with
spark.createDataFrame) works correctly.Expected behavior
Duplicate rows should have
falsein the row-level Boolean column.Workaround
Calling
df.repartition(1)before passing to the VerificationSuite produces correct row-level results. This suggests a row ordering / partition alignment issue in the underlying Deequ JVM method.Related
Environment