Divide a dataframe into multiple smaller dataframes based on values in multiple columns in Scala
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-23-2016 08:27 AM
I have to divide a dataframe into multiple smaller dataframes based on values in columns like - gender and state , the end goal is to pick up random samples from each dataframe
I am trying to implement a sample as explained below, I am quite new to this spark/scala, so need some inputs as to how this can be implemented in an efficient way. I have a sample data frame like this:
| id| name|gender|state|dept|
|1|Ram| M| KA| ECE|
|2|Rani| F| AP| CSE|
|3|Bharat| M| KA| EEE|
|4|Jaya| M| MH| MEC|
|5|Sita| F| MH| ECE|
|6|Warner| M| KA| CSE|
|7|Maya| F| UP| EEE|
|8|Chaya| F| UP| CSE|
+---+------+------+-----+----+
I would like to divide this data frame into sub data frames based on gender and state columns .
Firstly, I have divided it into two using filter as per gender:
df1:
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|1|Ram| M| KA| ECE|
|3|Bharat| M| KA| EEE|
|4|Jaya| M| MH| MEC|
|6|Warner| M| KA| CSE|
+---+------+------+-----+----+
df2:
+---+-----+------+-----+----+
| id| name|gender|state|dept|
+---+-----+------+-----+----+
|2|Rani| F| AP| CSE|
|5|Sita| F| MH| ECE|
|7|Maya| F| UP| EEE|
|8|Chaya| F| UP| CSE|
+---+-----+------+-----+----+
I have created list of genders using
val colName ="gender"val genderList = df.select(colName).distinct().collect()
And then use this in an iterative loop that produce a number of dataframes based on state , e.g.
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
1|Ram| M| KA| ECE|
|3|Bharat| M| KA| EEE|
|6|Warner| M| KA| CSE|
+---+------+------+-----+----+
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
4|Jaya| M| MH| MEC|
+---+------+------+-----+----+
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
2|Rani| F| AP| CSE|
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|5|Sita| F| MH| ECE|
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|7|Maya| F| UP| EEE|
|8|Chaya| F| UP| CSE|
+---+-----+------+-----+----+
but the actual data frame will have a large data which would make this code tedious. Is there a way of doing this in an efficient way?
I'm quite new to this and still learning, so if there is actually a different approach to this problem, let me know I'm open to suggestions.
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2016 06:54 AM
What's the purpose of creating those smaller dataframes? Are you trying to write them out to separate files?
You could just use a filter command and filter by gender, and then generate random samples for each resulting dataframe if you need to.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-27-2023 02:02 AM
@raela I also have similar usecase. I am writing data to different databricks tables based on colum value.
But I am getting insufficient disk space error and driver is getting killed. I am suspecting
df.select(colName).distinct().collect()
step is taking lot of memory in driver as dataframe is huge.
Is there any recommended way here?

