Data Engineering

by Callum • New Contributor II

12-01-2022 7:05:53 AM

13117 Views
3 replies
2 kudos

Pyspark Pandas column or index name appears to persist after being dropped or removed.

So, I have this code for merging dataframes with pyspark pandas. And I want the index of the left dataframe to persist throughout the joins. So following suggestions from others wanting to keep the index after merging, I set the index to a column bef...

Data Engineering

13117 Views
3 replies
2 kudos

12-01-2022 7:05:53 AM

View Replies

Latest Reply

Serlal
New Contributor III

01-31-2023 3:01:12 AM

2 kudos

Hi!I tried debugging your code and I think that the error you get is simply because the column exists in two instances of your dataframe within your loop.I tried adding some extra debug lines in your merge_dataframes function:and after executing that...

2 kudos

01-31-2023 3:01:12 AM

2 More Replies

by Mado • Valued Contributor II

10-17-2022 3:11:09 PM

9007 Views
4 replies
2 kudos

Resolved! Pandas API on Spark, Does it run on a multi-node cluster?

Hi, I have a few questions about "Pandas API on Spark". Thanks for your time to read my questions1) Input to these functions are Pandas DataFrame or PySpark DataFrame?2) When I use any pandas function (like isna, size, apply, where, etc ), does it ru...

Data Engineering

9007 Views
4 replies
2 kudos

10-17-2022 3:11:09 PM

View Replies

Latest Reply

Debayan
Databricks Employee

10-18-2022 5:46:39 AM

2 kudos

Hi @Mohammad Saber , Pandas dataset lives in the single machine, and is naturally iterable locally within the same machine. However, pandas-on-Spark dataset lives across multiple machines, and they are computed in a distributed manner. It is difficu...

2 kudos

10-18-2022 5:46:39 AM

3 More Replies

Databricks Community

Forum Posts

Pyspark Pandas column or index name appears to persist after being dropped or removed.

Resolved! Pandas API on Spark, Does it run on a multi-node cluster?