How can I split a Spark Dataframe into n equal Dataframes (by rows)? I tried to add a Row ID column to acheive this but was unsuccessful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-20-2016 11:59 AM
I have a dataframe that has 5M rows. I need to split it up into 5 dataframes of ~1M rows each.
This would be easy if I could create a column that contains Row ID. Is that possible?
- Labels:
-
Spark--dataframe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-20-2016 03:01 PM
Hi Nithin,
You can use the DataFrame's randomSplit function. For example:
val df = sc.parallelize(1 to 10000).toDF("value")
val splitDF = df.randomSplit(Array(1,1,1,1,1))
val (df1,df2,df3,df4,df5) = (splitDF(0),splitDF(1),splitDF(2),splitDF(3),splitDF(4))
The problem with this is that is does not exactly do perfectly even splits. This might not be a big concern to you, especially because you have many records. Would this be okay?
Sidd
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-21-2016 11:28 AM
Thank you Sidd!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2017 01:56 PM
I can verify that this answer works. @Nithin Tiruveedhi , could you accept this answer as the one that works since you mentioned it also worked for you? It helps because this will move this answer up to the top and indicate that it solved your question, making it more obvious that this solution works.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-03-2018 10:59 PM
I have same kind of requirement to break 200 millions rows into equal size of batches of (10K), but I have a constraint that batch must not have more than 10K rows (lesser is fine), will it work in my case?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-21-2016 09:17 AM
@Nithin Tiruveedhi Please try as below. Below is an example for word count logic.
val tmpTable1 = sqlContext.sql("select row_number() over (order by count) as rnk,word,count from wordcount")
tmpTable1.registertempTable("wordcount_rownum")
sqlContext.cacheTable("wordcount_rownum")
val df1 = sql("select from wordcount_rownum where rnk <=1000000")
val df2 = sql("select
from wordcount_rowum where rnk between 1000001 and 2000000")
val df3 = sql("select * from wordcount_rowum where rnk between 2000001 and 3000000")
val df4 = sql("select * from wordcount_rowum where rnk between 3000001 and 4000000")
val df5 = sql("select * from wordcount_rowum where rnk between 4000001 and 5000000")
Hope this helps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-21-2016 11:28 AM
Thank you Giri!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-04-2018 12:40 AM
@girivaratharajan
Thank you for your answer, I have same kind of requirement, but I have certain quires on top your answer. How does row_number internal work specially with "over (order by count)". Will it bring all row to driver from executors (which will not efficient in case of huge data)? And in your above answer, will row_number produce strictly contiguous auto increment number.
Will it properly with 200 millions records? Please help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-26-2017 10:55 PM
How can I split it based on Columns. Suppose I have 100 columns. I want to break the original DF into 5 DF each with 20 columns
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2017 01:54 PM
Don't ask questions as an Answer to a Question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-07-2023 07:34 AM
how to split the dataframe with according to size.
example if i am having dataframe with 10 million records...
i want to split them in smaller dataframes.. after writing the smaller dataframes should not exceed the 100mb.
is there any possibility like this
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-09-2023 10:14 PM
Hi @NithinTiruveedh An alternate solution to achieve this would be to use the NTILE() function.
For your use case, you have to perform ntile(5) which will split your dataset of 5M rows into 5 groups of 1M rows each.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-12-2023 01:19 AM
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

