cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to avoid empty/null keys in DataFrame groupby?

UmeshKacha
New Contributor II

Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One reason I see is my data is skew some of my group by keys are empty. How do I avoid empty group by keys in DataFrame? Does DataFrame avoid empty group by key? I have around 8 keys on which I do group by.

sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla bla");

3 REPLIES 3

silvio
New Contributor II

Hi Umesh,

If you want to completely ignore the null/empty values then you could simply filter before you do the groupBy, but are you wanting to keep those values?

If you want to keep the null values and avoid the skew, you could try splitting the DataFrame. See if you think this would meet your needs:

val noNulls = sourceFrame .filter(!isnull($"colE")) .groupBy($"colB", $"colC", $"colD", $"colE") .agg(sum($"colA"))

val onlyNulls = sourceFrame .filter(isnull($"colE")) .groupBy($"colB", $"colC", $"colD") .agg(sum($"colA"))

You can also use the null value replacement in DataFrameNaFunctions.

Thanks,

Silvio

UmeshKacha
New Contributor II

HI Silvio thanks much for the answer. I dont want to ignore nulls/empty spaces in group by so above solution will work and wont affect end results. What do I do with two DataFrames? Should I union them?

silvio
New Contributor II

@Umesh,

Yes you could union them to reconstruct the single table, but you first have to add the missing column back and the columns need to be in the same order (union simply concatenates the two tables without consideration of column ordering).

Thanks,

Silvio

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group