cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How can I split a Spark Dataframe into n equal Dataframes (by rows)? I tried to add a Row ID column to acheive this but was unsuccessful.

NithinTiruveedh
New Contributor II

I have a dataframe that has 5M rows. I need to split it up into 5 dataframes of ~1M rows each.

This would be easy if I could create a column that contains Row ID. Is that possible?

12 REPLIES 12

SiddSingal
New Contributor II

Hi Nithin,

You can use the DataFrame's randomSplit function. For example:

val df = sc.parallelize(1 to 10000).toDF("value") 
val splitDF = df.randomSplit(Array(1,1,1,1,1)) 
val (df1,df2,df3,df4,df5) = (splitDF(0),splitDF(1),splitDF(2),splitDF(3),splitDF(4))

The problem with this is that is does not exactly do perfectly even splits. This might not be a big concern to you, especially because you have many records. Would this be okay?

Sidd

Thank you Sidd!

JosiahB
New Contributor II

I can verify that this answer works. @Nithin Tiruveedhiโ€‹ , could you accept this answer as the one that works since you mentioned it also worked for you? It helps because this will move this answer up to the top and indicate that it solved your question, making it more obvious that this solution works.

I have same kind of requirement to break 200 millions rows into equal size of batches of (10K), but I have a constraint that batch must not have more than 10K rows (lesser is fine), will it work in my case?

girivaratharaja
New Contributor III

@Nithin Tiruveedhiโ€‹ Please try as below. Below is an example for word count logic.

val tmpTable1 = sqlContext.sql("select row_number() over (order by count) as rnk,word,count from wordcount")

tmpTable1.registertempTable("wordcount_rownum")

sqlContext.cacheTable("wordcount_rownum")

val df1 = sql("select from wordcount_rownum where rnk <=1000000")

 

val df2 = sql("select

from wordcount_rowum where rnk between 1000001 and 2000000")

val df3 = sql("select * from wordcount_rowum where rnk between 2000001 and 3000000")

val df4 = sql("select * from wordcount_rowum where rnk between 3000001 and 4000000")

val df5 = sql("select * from wordcount_rowum where rnk between 4000001 and 5000000")

Hope this helps.

Thank you Giri!

@girivaratharajanโ€‹ 

Thank you for your answer, I have same kind of requirement, but I have certain quires on top your answer. How does row_number internal work specially with "over (order by count)". Will it bring all row to driver from executors (which will not efficient in case of huge data)? And in your above answer, will row_number produce strictly contiguous auto increment number.

Will it properly with 200 millions records? Please help.

OccasionalVisit
New Contributor II

How can I split it based on Columns. Suppose I have 100 columns. I want to break the original DF into 5 DF each with 20 columns

Don't ask questions as an Answer to a Question.

durgam_anil
New Contributor II

how to split the dataframe with according to size.
example if i am having dataframe with 10 million records...
i want to split them in smaller  dataframes.. after writing the smaller dataframes should not exceed the 100mb.
is there any possibility like this

 

Tharun-Kumar
Databricks Employee
Databricks Employee

Hi @NithinTiruveedh An alternate solution to achieve this would be to use the NTILE() function. 

For your use case, you have to perform ntile(5) which will split your dataset of 5M rows into 5 groups of 1M rows each.

Anonymous
Not applicable

Hi @NithinTiruveedh 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group