<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Divide a dataframe into multiple smaller dataframes based on values in multiple columns in Scala in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/divide-a-dataframe-into-multiple-smaller-dataframes-based-on/m-p/29338#M21078</link>
    <description>&lt;P&gt;I have to divide a dataframe into multiple smaller dataframes based on values in columns like - gender and state , the end goal is to pick up random samples from each dataframe&lt;/P&gt;&lt;P&gt;I am trying to implement a sample as explained below, I am quite new to this spark/scala, so need some inputs as to how this can be implemented in an efficient way. I have a sample data frame like this:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;| id|  name|gender|state|dept|
|1|Ram|     M|   KA| ECE|
|2|Rani|     F|   AP| CSE|
|3|Bharat|     M|   KA| EEE|
|4|Jaya|     M|   MH| MEC|
|5|Sita|     F|   MH| ECE|
|6|Warner|     M|   KA| CSE|
|7|Maya|     F|   UP| EEE|
|8|Chaya|     F|   UP| CSE|
+---+------+------+-----+----+&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I would like to divide this data frame into sub data frames based on gender and state columns .&lt;/P&gt;&lt;P&gt;Firstly, I have divided it into two using filter as per gender:&lt;/P&gt;&lt;P&gt;df1:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;+---+------+------+-----+----+
| id|  name|gender|state|dept|
+---+------+------+-----+----+
|1|Ram|     M|   KA| ECE|
|3|Bharat|     M|   KA| EEE|
|4|Jaya|     M|   MH| MEC|
|6|Warner|     M|   KA| CSE|
+---+------+------+-----+----+&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;df2:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;+---+-----+------+-----+----+
| id| name|gender|state|dept|
+---+-----+------+-----+----+
|2|Rani|     F|   AP| CSE|
|5|Sita|     F|   MH| ECE|
|7|Maya|     F|   UP| EEE|
|8|Chaya|     F|   UP| CSE|
+---+-----+------+-----+----+&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I have created list of genders using&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;val colName ="gender"val genderList = df.select(colName).distinct().collect()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;And then use this in an iterative loop that produce a number of dataframes based on state , e.g.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
1|Ram| M| KA| ECE|
|3|Bharat| M| KA| EEE|
|6|Warner| M| KA| CSE|
+---+------+------+-----+----+&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
4|Jaya| M| MH| MEC|
+---+------+------+-----+----+
 +---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
2|Rani| F| AP| CSE|&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|5|Sita| F| MH| ECE| 
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|7|Maya| F| UP| EEE|
|8|Chaya| F| UP| CSE|
+---+-----+------+-----+----+&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;but the actual data frame will have a large data which would make this code tedious. Is there a way of doing this in an efficient way?&lt;/P&gt;&lt;P&gt;I'm quite new to this and still learning, so if there is actually a different approach to this problem, let me know I'm open to suggestions.&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;</description>
    <pubDate>Wed, 23 Nov 2016 16:27:33 GMT</pubDate>
    <dc:creator>Rani</dc:creator>
    <dc:date>2016-11-23T16:27:33Z</dc:date>
    <item>
      <title>Divide a dataframe into multiple smaller dataframes based on values in multiple columns in Scala</title>
      <link>https://community.databricks.com/t5/data-engineering/divide-a-dataframe-into-multiple-smaller-dataframes-based-on/m-p/29338#M21078</link>
      <description>&lt;P&gt;I have to divide a dataframe into multiple smaller dataframes based on values in columns like - gender and state , the end goal is to pick up random samples from each dataframe&lt;/P&gt;&lt;P&gt;I am trying to implement a sample as explained below, I am quite new to this spark/scala, so need some inputs as to how this can be implemented in an efficient way. I have a sample data frame like this:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;| id|  name|gender|state|dept|
|1|Ram|     M|   KA| ECE|
|2|Rani|     F|   AP| CSE|
|3|Bharat|     M|   KA| EEE|
|4|Jaya|     M|   MH| MEC|
|5|Sita|     F|   MH| ECE|
|6|Warner|     M|   KA| CSE|
|7|Maya|     F|   UP| EEE|
|8|Chaya|     F|   UP| CSE|
+---+------+------+-----+----+&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I would like to divide this data frame into sub data frames based on gender and state columns .&lt;/P&gt;&lt;P&gt;Firstly, I have divided it into two using filter as per gender:&lt;/P&gt;&lt;P&gt;df1:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;+---+------+------+-----+----+
| id|  name|gender|state|dept|
+---+------+------+-----+----+
|1|Ram|     M|   KA| ECE|
|3|Bharat|     M|   KA| EEE|
|4|Jaya|     M|   MH| MEC|
|6|Warner|     M|   KA| CSE|
+---+------+------+-----+----+&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;df2:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;+---+-----+------+-----+----+
| id| name|gender|state|dept|
+---+-----+------+-----+----+
|2|Rani|     F|   AP| CSE|
|5|Sita|     F|   MH| ECE|
|7|Maya|     F|   UP| EEE|
|8|Chaya|     F|   UP| CSE|
+---+-----+------+-----+----+&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I have created list of genders using&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;val colName ="gender"val genderList = df.select(colName).distinct().collect()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;And then use this in an iterative loop that produce a number of dataframes based on state , e.g.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
1|Ram| M| KA| ECE|
|3|Bharat| M| KA| EEE|
|6|Warner| M| KA| CSE|
+---+------+------+-----+----+&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
4|Jaya| M| MH| MEC|
+---+------+------+-----+----+
 +---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+|
2|Rani| F| AP| CSE|&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|5|Sita| F| MH| ECE| 
+---+------+------+-----+----+
| id| name|gender|state|dept|
+---+------+------+-----+----+
|7|Maya| F| UP| EEE|
|8|Chaya| F| UP| CSE|
+---+-----+------+-----+----+&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;but the actual data frame will have a large data which would make this code tedious. Is there a way of doing this in an efficient way?&lt;/P&gt;&lt;P&gt;I'm quite new to this and still learning, so if there is actually a different approach to this problem, let me know I'm open to suggestions.&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;</description>
      <pubDate>Wed, 23 Nov 2016 16:27:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/divide-a-dataframe-into-multiple-smaller-dataframes-based-on/m-p/29338#M21078</guid>
      <dc:creator>Rani</dc:creator>
      <dc:date>2016-11-23T16:27:33Z</dc:date>
    </item>
    <item>
      <title>Re: Divide a dataframe into multiple smaller dataframes based on values in multiple columns in Scala</title>
      <link>https://community.databricks.com/t5/data-engineering/divide-a-dataframe-into-multiple-smaller-dataframes-based-on/m-p/29339#M21079</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;What's the purpose of creating those smaller dataframes? Are you trying to write them out to separate files?&lt;/P&gt;
&lt;P&gt;You could just use a filter command and filter by gender, and then generate random samples for each resulting dataframe if you need to. &lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 02 Dec 2016 14:54:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/divide-a-dataframe-into-multiple-smaller-dataframes-based-on/m-p/29339#M21079</guid>
      <dc:creator>raela</dc:creator>
      <dc:date>2016-12-02T14:54:59Z</dc:date>
    </item>
    <item>
      <title>Re: Divide a dataframe into multiple smaller dataframes based on values in multiple columns in Scala</title>
      <link>https://community.databricks.com/t5/data-engineering/divide-a-dataframe-into-multiple-smaller-dataframes-based-on/m-p/49985#M28671</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/35979"&gt;@raela&lt;/a&gt;&amp;nbsp;I also have similar usecase. I am writing data to different databricks tables based on colum value.&lt;BR /&gt;But I am getting insufficient disk space error and driver is getting killed. I am suspecting&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;df.select(colName).distinct().collect()&lt;/PRE&gt;&lt;P&gt;step is taking lot of memory in driver as dataframe is huge.&lt;/P&gt;&lt;P&gt;Is there any recommended way here?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 27 Oct 2023 09:02:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/divide-a-dataframe-into-multiple-smaller-dataframes-based-on/m-p/49985#M28671</guid>
      <dc:creator>subham0611</dc:creator>
      <dc:date>2023-10-27T09:02:07Z</dc:date>
    </item>
  </channel>
</rss>

