I have to divide a dataframe into multiple smaller dataframes based on values in columns like - gender and state , the end goal is to pick up random samples from each dataframe
I am trying to implement a sample as explained below, I am quite new to this spark/scala, so need some inputs as to how this can be implemented in an efficient way. I have a sample data frame like this:
| id| name|gender|state|dept|
|1|Ram| M| KA| ECE|
|2|Rani| F| AP| CSE|
|3|Bharat| M| KA| EEE|
|4|Jaya| M| MH| MEC|
|5|Sita| F| MH| ECE|
|6|Warner| M| KA| CSE|
|7|Maya| F| UP| EEE|
|8|Chaya| F| UP| CSE|
+---+------+------+-----+----+
I would like to divide this data frame into sub data frames based on gender and state columns .
Firstly, I have divided it into two using filter as per gender:
df1:
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|1|Ram| M| KA| ECE|
|3|Bharat| M| KA| EEE|
|4|Jaya| M| MH| MEC|
|6|Warner| M| KA| CSE|
+---+------+------+-----+----+
df2:
+---+-----+------+-----+----+
| id| name|gender|state|dept|
+---+-----+------+-----+----+
|2|Rani| F| AP| CSE|
|5|Sita| F| MH| ECE|
|7|Maya| F| UP| EEE|
|8|Chaya| F| UP| CSE|
+---+-----+------+-----+----+
I have created list of genders using
val colName ="gender"val genderList = df.select(colName).distinct().collect()
And then use this in an iterative loop that produce a number of dataframes based on state , e.g.
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
1|Ram| M| KA| ECE|
|3|Bharat| M| KA| EEE|
|6|Warner| M| KA| CSE|
+---+------+------+-----+----+
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
4|Jaya| M| MH| MEC|
+---+------+------+-----+----+
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
2|Rani| F| AP| CSE|
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|5|Sita| F| MH| ECE|
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|7|Maya| F| UP| EEE|
|8|Chaya| F| UP| CSE|
+---+-----+------+-----+----+
but the actual data frame will have a large data which would make this code tedious. Is there a way of doing this in an efficient way?
I'm quite new to this and still learning, so if there is actually a different approach to this problem, let me know I'm open to suggestions.
Regards