pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-29-2024 01:06 PM
I noticed that on some Databricks 14.3 clusters, I get DataFrames with type pyspark.sql.connect.dataframe.DataFrame, while on other clusters also with Databricks 14.3, the exact same code gets DataFrames of type pyspark.sql.DataFrame
pyspark.sql.connect.dataframe.DataFrame seems to be causing various issues.
for example:
- Code that checks for isinstance(df, DataFrame) does not recognize df to be a DataFrame, even though pyspark.sql.connect.dataframe.DataFrame inherits from pyspark.sql.DataFrame
- I get this error with pyspark.sql.connect.dataframe.DataFrame and a third-party library (Great Expectations), but not with pyspark.sql.connect.DataFrame [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "<column name>". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704
To help investigate, I would like to know:
- What is the difference between pyspark.sql.connect.dataframe.DataFrame and pyspark.sql.DataFrame?
- What determines if I will get one type of DataFrame or the other?
- Does pyspark.sql.connect.dataframe.DataFrame have limitations that would lead the issues I have to be expected?