How to avoid empty/null keys in DataFrame groupby?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-21-2016 01:37 PM
Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One reason I see is my data is skew some of my group by keys are empty. How do I avoid empty group by keys in DataFrame? Does DataFrame avoid empty group by key? I have around 8 keys on which I do group by.
sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla bla");
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-03-2016 09:39 AM
Hi Umesh,
If you want to completely ignore the null/empty values then you could simply filter before you do the groupBy, but are you wanting to keep those values?
If you want to keep the null values and avoid the skew, you could try splitting the DataFrame. See if you think this would meet your needs:
val noNulls = sourceFrame .filter(!isnull($"colE")) .groupBy($"colB", $"colC", $"colD", $"colE") .agg(sum($"colA"))val onlyNulls = sourceFrame .filter(isnull($"colE")) .groupBy($"colB", $"colC", $"colD") .agg(sum($"colA"))
You can also use the null value replacement in DataFrameNaFunctions.
Thanks,
Silvio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-03-2016 01:53 PM
HI Silvio thanks much for the answer. I dont want to ignore nulls/empty spaces in group by so above solution will work and wont affect end results. What do I do with two DataFrames? Should I union them?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-03-2016 03:04 PM
@Umesh,
Yes you could union them to reconstruct the single table, but you first have to add the missing column back and the columns need to be in the same order (union simply concatenates the two tables without consideration of column ordering).
Thanks,
Silvio
