Hello,
I'm trying to setup a notebook for tests or data quality checks. The name is not important.
I basically read a table (the ETL output process - actual data).
Then I read another table and do the calculation in the notebook (expected data)
I'm stuck at the assertEqual(actual_df, expected_df) part. Basically the assert never works no matter the library I'm using.
I tried with Chispa (a pyspark library for testing, very convenient to avoid doing collects and it help showing the exact row where the differences are) but it didn't work, so I tried with unittest module, but same problem.
It's as if the part where the collect happens is skipped and the assert is never triggered. (the collect works if I do it in any other cell)
Here's some code to show you the logic:
# cell 1
expected_data_query = "select ***"
expected_data_df = spark.sql(expected_data_query)
# cell 2
actual_data_query = "select ***"
actual_data_df = spark.sql(actual_data_query)
# cell 3
# starts the pyspark job then they all end up in "skipped state"
assert_df_equality(actual_accretio_timechange_df, actual_accretio_timechange_df)
# cell 4
# same as cell 3 # can't find the code but I inherited unittest module in a class, made # a unit test function and then ran it in the same way as the documentation says:
retcode = pytest.main([".", "-v", "-p", "no:cacheprovider"])