- 7712 Views
- 4 replies
- 2 kudos
With in a loop I have few dataframes created. I can union them with out an issue if they have same schema using (df_unioned = reduce(DataFrame.unionAll, df_list). Now my problem is how to union them if one of the dataframe in df_list has different nu...
- 7712 Views
- 4 replies
- 2 kudos
Latest Reply
Hi,I have come across same scenario, using reduce() and unionByname we can implement the solution as below:val lstDF: List[Datframe] = List(df1,df2,df3,df4,df5)val combinedDF = lstDF.reduce((df1, df2) => df1.unionByName(df2, allowMissingColumns = tru...
3 More Replies
- 4897 Views
- 2 replies
- 2 kudos
What I am trying to dodf = None
# For all of the IDs that are valid
for id in ids:
# Get the parts of the data from different sources
df_1 = spark.read.parquet(url_for_id)
df_2 = spark.read.parquet(url_for_id)
...
# Join together the pa...
- 4897 Views
- 2 replies
- 2 kudos
Latest Reply
@Erik Louie​ :There are several strategies that you can use to handle large joins like this in Spark:Use a broadcast join: If one of your dataframes is relatively small (less than 10-20 GB), you can use a broadcast join to avoid shuffling data. A bro...
1 More Replies
by
Geeya
• New Contributor II
- 1005 Views
- 1 replies
- 0 kudos
The process for me to build model is:filter dataset and split into two datasetsfit model based on two datasets union two datasetsrepeat 1-3 stepsThe problem is that after several iterations, the model fitting time becomes longer dramatically, and the...
- 1005 Views
- 1 replies
- 0 kudos
Latest Reply
I assume that you are using PySpark to train a model? It sounds like you are collecting data on the driver and likely need to increase the size. Can you share any code?