cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Divide a dataframe into multiple smaller dataframes based on values in multiple columns in Scala

Rani
New Contributor

I have to divide a dataframe into multiple smaller dataframes based on values in columns like - gender and state , the end goal is to pick up random samples from each dataframe

I am trying to implement a sample as explained below, I am quite new to this spark/scala, so need some inputs as to how this can be implemented in an efficient way. I have a sample data frame like this:

| id|  name|gender|state|dept|
|1|Ram|     M|   KA| ECE|
|2|Rani|     F|   AP| CSE|
|3|Bharat|     M|   KA| EEE|
|4|Jaya|     M|   MH| MEC|
|5|Sita|     F|   MH| ECE|
|6|Warner|     M|   KA| CSE|
|7|Maya|     F|   UP| EEE|
|8|Chaya|     F|   UP| CSE|
+---+------+------+-----+----+

I would like to divide this data frame into sub data frames based on gender and state columns .

Firstly, I have divided it into two using filter as per gender:

df1:

+---+------+------+-----+----+
| id|  name|gender|state|dept|
+---+------+------+-----+----+
|1|Ram|     M|   KA| ECE|
|3|Bharat|     M|   KA| EEE|
|4|Jaya|     M|   MH| MEC|
|6|Warner|     M|   KA| CSE|
+---+------+------+-----+----+

df2:

+---+-----+------+-----+----+
| id| name|gender|state|dept|
+---+-----+------+-----+----+
|2|Rani|     F|   AP| CSE|
|5|Sita|     F|   MH| ECE|
|7|Maya|     F|   UP| EEE|
|8|Chaya|     F|   UP| CSE|
+---+-----+------+-----+----+

I have created list of genders using

val colName ="gender"val genderList = df.select(colName).distinct().collect()

And then use this in an iterative loop that produce a number of dataframes based on state , e.g.

+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
1|Ram| M| KA| ECE|
|3|Bharat| M| KA| EEE|
|6|Warner| M| KA| CSE|
+---+------+------+-----+----+
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
4|Jaya| M| MH| MEC|
+---+------+------+-----+----+
 +---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
2|Rani| F| AP| CSE|
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|5|Sita| F| MH| ECE| 
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|7|Maya| F| UP| EEE|
|8|Chaya| F| UP| CSE|
+---+-----+------+-----+----+

but the actual data frame will have a large data which would make this code tedious. Is there a way of doing this in an efficient way?

I'm quite new to this and still learning, so if there is actually a different approach to this problem, let me know I'm open to suggestions.

Regards

2 REPLIES 2

raela
Databricks Employee
Databricks Employee

What's the purpose of creating those smaller dataframes? Are you trying to write them out to separate files?

You could just use a filter command and filter by gender, and then generate random samples for each resulting dataframe if you need to.

subham0611
New Contributor II

@raela I also have similar usecase. I am writing data to different databricks tables based on colum value.
But I am getting insufficient disk space error and driver is getting killed. I am suspecting 

df.select(colName).distinct().collect()

step is taking lot of memory in driver as dataframe is huge.

Is there any recommended way here? 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group