Skip to content

rowLevelResultsAsDataFrame returns all True on multi-partition DataFrames #272

Description

@billpratt

Bug

VerificationResult.rowLevelResultsAsDataFrame() returns all true Boolean values for row-level check columns when the input DataFrame has multiple partitions (e.g., read from a Delta table with spark.table()).

Reproduction

  1. Read a DataFrame from a persisted Delta table (~200K rows, multiple partitions)
  2. Run a VerificationSuite with isUnique constraint — aggregate result correctly reports uniqueness < 1.0
  3. Call VerificationResult.rowLevelResultsAsDataFrame(spark, result, data)
  4. The Boolean column is true for all rows — no rows are flagged as failures

A minimal in-memory DataFrame (e.g., 4 rows with spark.createDataFrame) works correctly.

Expected behavior

Duplicate rows should have false in the row-level Boolean column.

Workaround

Calling df.repartition(1) before passing to the VerificationSuite produces correct row-level results. This suggests a row ordering / partition alignment issue in the underlying Deequ JVM method.

Related

Environment

  • Spark 3.5 (Microsoft Fabric)
  • pydeequ (latest)
  • Delta Lake table source

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions