Databricks Community

NithinTiruveedh · ‎06-20-2016

I have a dataframe that has 5M rows. I need to split it up into 5 dataframes of ~1M rows each.

This would be easy if I could create a column that contains Row ID. Is that possible?

SiddSingal · ‎06-20-2016

Hi Nithin,

You can use the DataFrame's randomSplit function. For example:

val df = sc.parallelize(1 to 10000).toDF("value") 
val splitDF = df.randomSplit(Array(1,1,1,1,1)) 
val (df1,df2,df3,df4,df5) = (splitDF(0),splitDF(1),splitDF(2),splitDF(3),splitDF(4))

The problem with this is that is does not exactly do perfectly even splits. This might not be a big concern to you, especially because you have many records. Would this be okay?

Sidd

NithinTiruveedh · ‎06-21-2016

Thank you Sidd!

JosiahB · ‎12-20-2017

I can verify that this answer works. @Nithin Tiruveedhi , could you accept this answer as the one that works since you mentioned it also worked for you? It helps because this will move this answer up to the top and indicate that it solved your question, making it more obvious that this solution works.

sd_hrishi · ‎03-03-2018

I have same kind of requirement to break 200 millions rows into equal size of batches of (10K), but I have a constraint that batch must not have more than 10K rows (lesser is fine), will it work in my case?

girivaratharaja · ‎06-21-2016

@Nithin Tiruveedhi Please try as below. Below is an example for word count logic.

val tmpTable1 = sqlContext.sql("select row_number() over (order by count) as rnk,word,count from wordcount")

tmpTable1.registertempTable("wordcount_rownum")

sqlContext.cacheTable("wordcount_rownum")

val df1 = sql("select from wordcount_rownum where rnk <=1000000")

val df2 = sql("select

from wordcount_rowum where rnk between 1000001 and 2000000")

val df3 = sql("select * from wordcount_rowum where rnk between 2000001 and 3000000")

val df4 = sql("select * from wordcount_rowum where rnk between 3000001 and 4000000")

val df5 = sql("select * from wordcount_rowum where rnk between 4000001 and 5000000")

Hope this helps.

NithinTiruveedh · ‎06-21-2016

Thank you Giri!

sd_hrishi · ‎03-04-2018

@girivaratharajan

Thank you for your answer, I have same kind of requirement, but I have certain quires on top your answer. How does row_number internal work specially with "over (order by count)". Will it bring all row to driver from executors (which will not efficient in case of huge data)? And in your above answer, will row_number produce strictly contiguous auto increment number.

Will it properly with 200 millions records? Please help.

OccasionalVisit · ‎02-26-2017

How can I split it based on Columns. Suppose I have 100 columns. I want to break the original DF into 5 DF each with 20 columns

JosiahB · ‎12-20-2017

Don't ask questions as an Answer to a Question.

durgam_anil · ‎07-07-2023

how to split the dataframe with according to size.
example if i am having dataframe with 10 million records...
i want to split them in smaller dataframes.. after writing the smaller dataframes should not exceed the 100mb.
is there any possibility like this

Tharun-Kumar · ‎07-09-2023

Hi @NithinTiruveedh An alternate solution to achieve this would be to use the NTILE() function.

For your use case, you have to perform ntile(5) which will split your dataset of 5M rows into 5 groups of 1M rows each.

Anonymous · ‎07-12-2023

Hi @NithinTiruveedh

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Databricks Community

How can I split a Spark Dataframe into n equal Dataframes (by rows)? I tried to add a Row ID column to acheive this but was unsuccessful.

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences