cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Dataframe Count before and after write command do not match

Riccardo96
New Contributor

Hi,

I have noticed a strange behaviour in a notebook where I am developing. When I use the notebook to read a single file the notebook works correctly, but when I set it to read multiple files at once, using the option recursive lookup, I have noticed that when I perform a count before writing to the final table and after the write process, the two count do not matches (picture attached)

Thanks in advance to everyone able to help me!

 
3 REPLIES 3

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @Riccardo96,

This behavior suggests that rows might be getting dropped or overwritten during the writing process, particularly when using the replaceWhere option with clustering or partitioning.

The replaceWhere option replaces data based on the specified condition (year, month, and day). If multiple files have overlapping data for the same day, some rows might get overwritten

You can debug this by running the before and after writing:

df_adobe_nav_utente.groupBy("year", "month", "day").count().show()

Riccardo96
New Contributor

'm working with databricks 15.4 LTS runtime

In this order the steps I did:

  1. Count(*) on dataframe: 99228246 rows
  2. Group by on dataframe, grouping per year, month, day: 99486114 rows
  3. Group by on output table, grouping per year, month, day: 0 rows (empty)
  4. Another count(*) on previous dataframe: 100167165 rows
  5. Group by on output table, grouping per year, month, day: 100031507 rows 

Riccardo96
New Contributor

I just found out I was populating a column with random variables, these variables are filtered in a join...so at each write and count those numbers change 😅 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group