<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/128925#M48375</link>
    <description>&lt;P&gt;+1 encountered this for the first time today over a year later from when it was first posted. I've had a piece of code that was checking if dataframes were an instance of DataFrame from spark.sql.DataFrame and it suddenly stopped working today because now my dataframes are&amp;nbsp;&lt;SPAN&gt;pyspark.sql.connect.dataframe.DataFrame&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 19 Aug 2025 23:50:04 GMT</pubDate>
    <dc:creator>kenmyers-8451</dc:creator>
    <dc:date>2025-08-19T23:50:04Z</dc:date>
    <item>
      <title>pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/71055#M34230</link>
      <description>&lt;P&gt;I noticed that on some Databricks 14.3 clusters, I get DataFrames with type&amp;nbsp;pyspark.sql.connect.dataframe.DataFrame, while on other clusters also with Databricks 14.3, the exact same code gets DataFrames of type pyspark.sql.DataFrame&lt;/P&gt;&lt;P&gt;pyspark.sql.connect.dataframe.DataFrame seems to be causing various issues.&lt;/P&gt;&lt;P&gt;for example:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Code that checks for isinstance(df, DataFrame) does not recognize df to be a DataFrame, even though pyspark.sql.connect.dataframe.DataFrame inherits from pyspark.sql.DataFrame&lt;/LI&gt;&lt;LI&gt;I get this error with pyspark.sql.connect.dataframe.DataFrame and a third-party library (Great Expectations), but not with&amp;nbsp;pyspark.sql.connect.DataFrame &amp;nbsp;&lt;STRONG&gt;[CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "&amp;lt;column name&amp;gt;". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704&lt;/STRONG&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;To help investigate, I would like to know:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;What is the difference between pyspark.sql.connect.dataframe.DataFrame and pyspark.sql.DataFrame?&lt;/LI&gt;&lt;LI&gt;What determines if I will get one type of DataFrame or the other?&lt;/LI&gt;&lt;LI&gt;Does&amp;nbsp;pyspark.sql.connect.dataframe.DataFrame have limitations that would lead the issues I have to be expected?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 29 May 2024 20:06:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/71055#M34230</guid>
      <dc:creator>ckarrasexo</dc:creator>
      <dc:date>2024-05-29T20:06:08Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/71259#M34270</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;That's incorrect. I use exactly the same code and either get a pyspark.sql.dataframe.DataFrame, or pyspark.sql.connect.dataframe.DataFrame depending on the cluster. It doesn't matter if I create the dataframe using spark.read.table, spark.sql, or even spark.createDataFrame for in-memory data, what changes the class I will get is the cluster configuration.&lt;/P&gt;&lt;P&gt;This screenshot illustrates what I mean. I ran the same notebook on two different clusters and will get a different DataFrame type depending on the cluster. The only difference I can see between the two clusters is that one is a single-user cluster, and the other one is a shared (multi-user) cluster. Both clusters use Databricks 14.3.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ckarrasexo_0-1717164524724.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/7985i67283D5382F52D1A/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="ckarrasexo_0-1717164524724.png" alt="ckarrasexo_0-1717164524724.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;So the choice for the class to use is an internal implementation decision by Databricks, and the question is what leads Databricks to pick one or another class, and, considering that they don't appear to be 100%&amp;nbsp;interchangeable, what are the limitations?&lt;/P&gt;&lt;P&gt;Also note that both classes have methods like select, filter, groupBy, cache, persist that can be used the same way with both classes. Both can also be used to run SQL queries or directly read a table without using a query.&lt;/P&gt;</description>
      <pubDate>Fri, 31 May 2024 14:13:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/71259#M34270</guid>
      <dc:creator>ckarrasexo</dc:creator>
      <dc:date>2024-05-31T14:13:41Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/82123#M36528</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/105923"&gt;@ckarrasexo&lt;/a&gt;&amp;nbsp;Any updates on this? I'm facing the same issue&lt;/P&gt;</description>
      <pubDate>Wed, 07 Aug 2024 06:35:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/82123#M36528</guid>
      <dc:creator>mchugani</dc:creator>
      <dc:date>2024-08-07T06:35:53Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/85116#M37230</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;I am also running into this issue, also with Great Expectations as it happens.&amp;nbsp; I have also tried using the read paquert like you suggested and am still getting the problematic format.&amp;nbsp; Is it possible to direct Databricks to create one type, or convert or cast between them?&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2024 22:37:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/85116#M37230</guid>
      <dc:creator>JSherrill</dc:creator>
      <dc:date>2024-08-27T22:37:25Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/85133#M37231</link>
      <description>&lt;P&gt;Additional info.&amp;nbsp; In Databricks 13.3, the spark variable we're provided is of type&amp;nbsp;pyspark.sql.SparkSession.&amp;nbsp; In 15.4 it is created as pyspark.sql.connect.session.SparkSession (both shared clusters; it may behave differently for single node configuration).&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Aug 2024 22:59:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/85133#M37231</guid>
      <dc:creator>JSherrill</dc:creator>
      <dc:date>2024-08-27T22:59:37Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/85951#M37287</link>
      <description>&lt;P&gt;What makes the difference is whether the cluster is using &lt;A href="https://www.databricks.com/blog/2023/04/18/spark-connect-available-apache-spark.html" target="_self"&gt;Spark Connect&lt;/A&gt; or not.&lt;BR /&gt;Shared clusters are using Spark Connect, so even the spark session is of different type:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_0-1724874549003.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10748iFE0B09FE2D376EF4/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_0-1724874549003.png" alt="filipniziol_0-1724874549003.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;To compare on single user cluster:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_2-1724874812915.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10750i4498E466E53FE6A1/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_2-1724874812915.png" alt="filipniziol_2-1724874812915.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;What I tested is that you can disable Spark Connect on the cluster by setting&amp;nbsp;&lt;SPAN&gt;spark.databricks.service.server.enabled to false, but in this case everything stops working:&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_1-1724874667480.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10749i0E9BD20BC500AA03/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_1-1724874667480.png" alt="filipniziol_1-1724874667480.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 28 Aug 2024 19:54:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/85951#M37287</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-08-28T19:54:21Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/90693#M37977</link>
      <description>&lt;P&gt;Hitting the same problems trying to check the type of variables to pick out DataFrames.&lt;/P&gt;&lt;P&gt;Ended up getting around this (temporarily at least) by importing the following instead:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;from pyspark.sql.connect.dataframe import DataFrame&lt;/DIV&gt;&lt;DIV&gt;both 'isinstance(df, DataFrame)' then works again for my dataframe variables that are of type '&lt;SPAN&gt;pyspark.sql.connect.dataframe.DataFrame'&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;(if you have already imported from '&lt;SPAN&gt;from pyspark.sql import DataFrame' you probably need to 'del DataFrame' then redo the import above).&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;Note that this does however produce a console message as follows, so ymmv:&lt;BR /&gt;sc will be removed in future DBR versions&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 17 Sep 2024 08:37:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/90693#M37977</guid>
      <dc:creator>Chris78</dc:creator>
      <dc:date>2024-09-17T08:37:02Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/93058#M38616</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/105923"&gt;@ckarrasexo&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;&lt;P&gt;I noticed that on some Databricks 14.3 clusters, I get DataFrames with type&amp;nbsp;pyspark.sql.connect.dataframe.DataFrame, while on other clusters also with Databricks 14.3, the exact same code gets DataFrames of type pyspark.sql.DataFrame&lt;/P&gt;&lt;P&gt;pyspark.sql.connect.dataframe.DataFrame seems to be causing various issues.&lt;/P&gt;&lt;P&gt;for example:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Code that checks for isinstance(df, DataFrame) does not recognize df to be a DataFrame, even though pyspark.sql.connect.dataframe.DataFrame inherits from pyspark.sql.DataFrame&lt;/LI&gt;&lt;LI&gt;I get this error with pyspark.sql.connect.dataframe.DataFrame and a third-party library (Great Expectations), but not with&amp;nbsp;pyspark.sql.connect.DataFrame &amp;nbsp;&lt;STRONG&gt;[CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "&amp;lt;column name&amp;gt;". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704&lt;/STRONG&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;To help investigate, I would like to know:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;What is the difference between pyspark.sql.connect.dataframe.DataFrame and pyspark.sql.DataFrame?&lt;/LI&gt;&lt;LI&gt;What determines if I will get one type of DataFrame or the other?&lt;/LI&gt;&lt;LI&gt;Does&amp;nbsp;pyspark.sql.connect.dataframe.DataFrame have limitations that would lead the issues I have to be expected?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Inconsistencies in my Databricks 14.3 where some clusters return DataFrames as pyspark.sql.connect.dataframe.DataFrame, while others return pyspark.sql.DataFrame. This affects type checking with isinstance(df, DataFrame), and I'm facing errors with Great Expectations, specifically "CANNOT_RESOLVE_DATAFRAME_COLUMN." Has anyone else dealt with this issue, and what solutions did you find?&lt;/P&gt;</description>
      <pubDate>Tue, 08 Oct 2024 10:10:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/93058#M38616</guid>
      <dc:creator>Ariusuke</dc:creator>
      <dc:date>2024-10-08T10:10:23Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/128925#M48375</link>
      <description>&lt;P&gt;+1 encountered this for the first time today over a year later from when it was first posted. I've had a piece of code that was checking if dataframes were an instance of DataFrame from spark.sql.DataFrame and it suddenly stopped working today because now my dataframes are&amp;nbsp;&lt;SPAN&gt;pyspark.sql.connect.dataframe.DataFrame&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 19 Aug 2025 23:50:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/128925#M48375</guid>
      <dc:creator>kenmyers-8451</dc:creator>
      <dc:date>2025-08-19T23:50:04Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/129089#M48435</link>
      <description>&lt;P&gt;&lt;EM&gt;I have found a work around for this issue. Basically, I create a dummy_df and then I check if the dataframe I want to check has the same type as the dummy_df.&lt;/EM&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;def get_dummy_df() -&amp;gt; DataFrame:
    """
    Generates a dummy DataFrame with a range of integers.

    This method creates a DataFrame containing integers starting from 0 up to (but not including) 2 
    using the current Spark session.

    Returns:
        DataFrame: A Spark DataFrame containing a single column with the values [0, 1].
    """
    spark_session = SparkSession.builder.appName(
            "dummy_df"
        ).getOrCreate()

    return spark_session.range(0, 2)

def is_spark_df(df_to_check: DataFrame) -&amp;gt; bool:
    """
    Checks if the provided object is a Spark DataFrame.
    
    This function compares the type of the provided DataFrame with a dummy DataFrame created 
    using the `get_dummy_df()` function. This is necessary because in Databricks, depending 
    on the cluster configuration, the DataFrame type can vary. If you import 
    `pyspark.sql.dataframe`, your type check may fail because Databricks can provide 
    `pyspark.sql.connect.dataframe`.

    Parameters:
    df_to_check (DataFrame): The DataFrame instance to check.

    Returns:
    bool: True if the object is a Spark DataFrame, False otherwise.
    
    For more information on this issue, please see: 
    https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/td-p/71055
    """
    return type(df_to_check) == type(get_dummy_df())&lt;/LI-CODE&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Gleydson C.&lt;/P&gt;</description>
      <pubDate>Thu, 21 Aug 2025 08:44:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-sql-connect-dataframe-dataframe-vs-pyspark-sql-dataframe/m-p/129089#M48435</guid>
      <dc:creator>Gleydson404</dc:creator>
      <dc:date>2025-08-21T08:44:11Z</dc:date>
    </item>
  </channel>
</rss>

