โ06-20-2016 11:59 AM
I have a dataframe that has 5M rows. I need to split it up into 5 dataframes of ~1M rows each.
This would be easy if I could create a column that contains Row ID. Is that possible?
โ06-20-2016 03:01 PM
Hi Nithin,
You can use the DataFrame's randomSplit function. For example:
val df = sc.parallelize(1 to 10000).toDF("value")
val splitDF = df.randomSplit(Array(1,1,1,1,1))
val (df1,df2,df3,df4,df5) = (splitDF(0),splitDF(1),splitDF(2),splitDF(3),splitDF(4))
The problem with this is that is does not exactly do perfectly even splits. This might not be a big concern to you, especially because you have many records. Would this be okay?
Sidd
โ06-21-2016 11:28 AM
Thank you Sidd!
โ12-20-2017 01:56 PM
I can verify that this answer works. @Nithin Tiruveedhiโ , could you accept this answer as the one that works since you mentioned it also worked for you? It helps because this will move this answer up to the top and indicate that it solved your question, making it more obvious that this solution works.
โ03-03-2018 10:59 PM
I have same kind of requirement to break 200 millions rows into equal size of batches of (10K), but I have a constraint that batch must not have more than 10K rows (lesser is fine), will it work in my case?
โ06-21-2016 09:17 AM
@Nithin Tiruveedhiโ Please try as below. Below is an example for word count logic.
val tmpTable1 = sqlContext.sql("select row_number() over (order by count) as rnk,word,count from wordcount")
tmpTable1.registertempTable("wordcount_rownum")
sqlContext.cacheTable("wordcount_rownum")
val df1 = sql("select from wordcount_rownum where rnk <=1000000")
val df2 = sql("select
from wordcount_rowum where rnk between 1000001 and 2000000")
val df3 = sql("select * from wordcount_rowum where rnk between 2000001 and 3000000")
val df4 = sql("select * from wordcount_rowum where rnk between 3000001 and 4000000")
val df5 = sql("select * from wordcount_rowum where rnk between 4000001 and 5000000")
Hope this helps.
โ06-21-2016 11:28 AM
Thank you Giri!
โ03-04-2018 12:40 AM
@girivaratharajanโ
Thank you for your answer, I have same kind of requirement, but I have certain quires on top your answer. How does row_number internal work specially with "over (order by count)". Will it bring all row to driver from executors (which will not efficient in case of huge data)? And in your above answer, will row_number produce strictly contiguous auto increment number.
Will it properly with 200 millions records? Please help.
โ02-26-2017 10:55 PM
How can I split it based on Columns. Suppose I have 100 columns. I want to break the original DF into 5 DF each with 20 columns
โ12-20-2017 01:54 PM
Don't ask questions as an Answer to a Question.
โ07-07-2023 07:34 AM
how to split the dataframe with according to size.
example if i am having dataframe with 10 million records...
i want to split them in smaller dataframes.. after writing the smaller dataframes should not exceed the 100mb.
is there any possibility like this
โ07-09-2023 10:14 PM
Hi @NithinTiruveedh An alternate solution to achieve this would be to use the NTILE() function.
For your use case, you have to perform ntile(5) which will split your dataset of 5M rows into 5 groups of 1M rows each.
โ07-12-2023 01:19 AM
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group