cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to avoid empty/null keys in DataFrame groupby?

UmeshKacha
New Contributor II

Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One reason I see is my data is skew some of my group by keys are empty. How do I avoid empty group by keys in DataFrame? Does DataFrame avoid empty group by key? I have around 8 keys on which I do group by.

sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla bla");

3 REPLIES 3

silvio
New Contributor II

Hi Umesh,

If you want to completely ignore the null/empty values then you could simply filter before you do the groupBy, but are you wanting to keep those values?

If you want to keep the null values and avoid the skew, you could try splitting the DataFrame. See if you think this would meet your needs:

val noNulls = sourceFrame .filter(!isnull($"colE")) .groupBy($"colB", $"colC", $"colD", $"colE") .agg(sum($"colA"))

val onlyNulls = sourceFrame .filter(isnull($"colE")) .groupBy($"colB", $"colC", $"colD") .agg(sum($"colA"))

You can also use the null value replacement in DataFrameNaFunctions.

Thanks,

Silvio

UmeshKacha
New Contributor II

HI Silvio thanks much for the answer. I dont want to ignore nulls/empty spaces in group by so above solution will work and wont affect end results. What do I do with two DataFrames? Should I union them?

silvio
New Contributor II

@Umesh,

Yes you could union them to reconstruct the single table, but you first have to add the missing column back and the columns need to be in the same order (union simply concatenates the two tables without consideration of column ordering).

Thanks,

Silvio
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.