cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Dataframe Count before and after write command do not match

Riccardo96
Visitor

Hi,

I have noticed a strange behaviour in a notebook where I am developing. When I use the notebook to read a single file the notebook works correctly, but when I set it to read multiple files at once, using the option recursive lookup, I have noticed that when I perform a count before writing to the final table and after the write process, the two count do not matches (picture attached)

Thanks in advance to everyone able to help me!

 
1 REPLY 1

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @Riccardo96,

This behavior suggests that rows might be getting dropped or overwritten during the writing process, particularly when using the replaceWhere option with clustering or partitioning.

The replaceWhere option replaces data based on the specified condition (year, month, and day). If multiple files have overlapping data for the same day, some rows might get overwritten

You can debug this by running the before and after writing:

df_adobe_nav_utente.groupBy("year", "month", "day").count().show()

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group