Topics with Label: Union

Forum Posts

Sorted by:

by KKo • Contributor III

03-28-2022 12:47:46 PM

7712 Views
4 replies
2 kudos

Resolved! Union Multiple dataframes in loop, with different schema

With in a loop I have few dataframes created. I can union them with out an issue if they have same schema using (df_unioned = reduce(DataFrame.unionAll, df_list). Now my problem is how to union them if one of the dataframe in df_list has different nu...

Data Engineering

7712 Views
4 replies
2 kudos

03-28-2022 12:47:46 PM

View Replies

Latest Reply

anoopunni
New Contributor II

07-23-2023 8:47:55 PM

2 kudos

Hi,I have come across same scenario, using reduce() and unionByname we can implement the solution as below:val lstDF: List[Datframe] = List(df1,df2,df3,df4,df5)val combinedDF = lstDF.reduce((df1, df2) => df1.unionByName(df2, allowMissingColumns = tru...

2 kudos

07-23-2023 8:47:55 PM

3 More Replies

by Erik_L • Contributor II

04-21-2023 10:46:09 AM

4897 Views
2 replies
2 kudos

Joining a big amount of data causes "Out of disk space error", how to ingest?

What I am trying to dodf = None # For all of the IDs that are valid for id in ids: # Get the parts of the data from different sources df_1 = spark.read.parquet(url_for_id) df_2 = spark.read.parquet(url_for_id) ... # Join together the pa...

Data Engineering

4897 Views
2 replies
2 kudos

04-21-2023 10:46:09 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-25-2023 10:16:09 PM

2 kudos

@Erik Louie :There are several strategies that you can use to handle large joins like this in Spark:Use a broadcast join: If one of your dataframes is relatively small (less than 10-20 GB), you can use a broadcast join to avoid shuffling data. A bro...

2 kudos

04-25-2023 10:16:09 PM

1 More Replies

by Geeya • New Contributor II

09-22-2021 12:36:52 PM

1005 Views
1 replies
0 kudos

After several iteration of filter and union, the data is bigger than spark.driver.maxResultSize

The process for me to build model is:filter dataset and split into two datasetsfit model based on two datasets union two datasetsrepeat 1-3 stepsThe problem is that after several iterations, the model fitting time becomes longer dramatically, and the...

Data Engineering

1005 Views
1 replies
0 kudos

09-22-2021 12:36:52 PM

View Replies

Latest Reply

Ryan_Chynoweth
Honored Contributor III

09-22-2021 1:11:44 PM

0 kudos

I assume that you are using PySpark to train a model? It sounds like you are collecting data on the driver and likely need to increase the size. Can you share any code?

0 kudos

09-22-2021 1:11:44 PM

Databricks

Resolved! Union Multiple dataframes in loop, with different schema

Joining a big amount of data causes "Out of disk space error", how to ingest?

After several iteration of filter and union, the data is bigger than spark.driver.maxResultSize