cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Divide a dataframe into multiple smaller dataframes based on values in multiple columns in Scala

Rani
New Contributor

I have to divide a dataframe into multiple smaller dataframes based on values in columns like - gender and state , the end goal is to pick up random samples from each dataframe

I am trying to implement a sample as explained below, I am quite new to this spark/scala, so need some inputs as to how this can be implemented in an efficient way. I have a sample data frame like this:

| id|  name|gender|state|dept|
|1|Ram|     M|   KA| ECE|
|2|Rani|     F|   AP| CSE|
|3|Bharat|     M|   KA| EEE|
|4|Jaya|     M|   MH| MEC|
|5|Sita|     F|   MH| ECE|
|6|Warner|     M|   KA| CSE|
|7|Maya|     F|   UP| EEE|
|8|Chaya|     F|   UP| CSE|
+---+------+------+-----+----+

I would like to divide this data frame into sub data frames based on gender and state columns .

Firstly, I have divided it into two using filter as per gender:

df1:

+---+------+------+-----+----+
| id|  name|gender|state|dept|
+---+------+------+-----+----+
|1|Ram|     M|   KA| ECE|
|3|Bharat|     M|   KA| EEE|
|4|Jaya|     M|   MH| MEC|
|6|Warner|     M|   KA| CSE|
+---+------+------+-----+----+

df2:

+---+-----+------+-----+----+
| id| name|gender|state|dept|
+---+-----+------+-----+----+
|2|Rani|     F|   AP| CSE|
|5|Sita|     F|   MH| ECE|
|7|Maya|     F|   UP| EEE|
|8|Chaya|     F|   UP| CSE|
+---+-----+------+-----+----+

I have created list of genders using

val colName ="gender"val genderList = df.select(colName).distinct().collect()

And then use this in an iterative loop that produce a number of dataframes based on state , e.g.

+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
1|Ram| M| KA| ECE|
|3|Bharat| M| KA| EEE|
|6|Warner| M| KA| CSE|
+---+------+------+-----+----+
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
4|Jaya| M| MH| MEC|
+---+------+------+-----+----+
 +---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
2|Rani| F| AP| CSE|
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|5|Sita| F| MH| ECE| 
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|7|Maya| F| UP| EEE|
|8|Chaya| F| UP| CSE|
+---+-----+------+-----+----+

but the actual data frame will have a large data which would make this code tedious. Is there a way of doing this in an efficient way?

I'm quite new to this and still learning, so if there is actually a different approach to this problem, let me know I'm open to suggestions.

Regards

2 REPLIES 2

raela
New Contributor III
New Contributor III

What's the purpose of creating those smaller dataframes? Are you trying to write them out to separate files?

You could just use a filter command and filter by gender, and then generate random samples for each resulting dataframe if you need to.

subham0611
New Contributor II

@raela I also have similar usecase. I am writing data to different databricks tables based on colum value.
But I am getting insufficient disk space error and driver is getting killed. I am suspecting 

df.select(colName).distinct().collect()

step is taking lot of memory in driver as dataframe is huge.

Is there any recommended way here? 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.