topic Re: Dataframe Count before and after write command do not match in Data Engineering

Dataframe Count before and after write command do not match

Riccardo96 — Mon, 25 Nov 2024 17:49:25 GMT

Hi,

I have noticed a strange behaviour in a notebook where I am developing. When I use the notebook to read a single file the notebook works correctly, but when I set it to read multiple files at once, using the option recursive lookup, I have noticed that when I perform a count before writing to the final table and after the write process, the two count do not matches (picture attached)

Thanks in advance to everyone able to help me!

Re: Dataframe Count before and after write command do not match

Alberto_Umana — Mon, 25 Nov 2024 18:18:58 GMT

Hello @Riccardo96,

This behavior suggests that rows might be getting dropped or overwritten during the writing process, particularly when using the replaceWhere option with clustering or partitioning.

The replaceWhere option replaces data based on the specified condition (year, month, and day). If multiple files have overlapping data for the same day, some rows might get overwritten

You can debug this by running the before and after writing:

df_adobe_nav_utente.groupBy("year", "month", "day").count().show()

Re: Dataframe Count before and after write command do not match

Riccardo96 — Wed, 27 Nov 2024 09:11:14 GMT

'm working with databricks 15.4 LTS runtime

In this order the steps I did:

Count(*) on dataframe: 99228246 rows
Group by on dataframe, grouping per year, month, day: 99486114 rows
Group by on output table, grouping per year, month, day: 0 rows (empty)
Another count(*) on previous dataframe: 100167165 rows
Group by on output table, grouping per year, month, day: 100031507 rows

Re: Dataframe Count before and after write command do not match

Riccardo96 — Wed, 27 Nov 2024 10:44:14 GMT

I just found out I was populating a column with random variables, these variables are filtered in a join...so at each write and count those numbers change 😅