cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

pyspark.sql.connect.dataframe.DataFrame vs pyspark.sql.DataFrame

ckarrasexo
New Contributor II

I noticed that on some Databricks 14.3 clusters, I get DataFrames with type pyspark.sql.connect.dataframe.DataFrame, while on other clusters also with Databricks 14.3, the exact same code gets DataFrames of type pyspark.sql.DataFrame

pyspark.sql.connect.dataframe.DataFrame seems to be causing various issues.

for example:

  • Code that checks for isinstance(df, DataFrame) does not recognize df to be a DataFrame, even though pyspark.sql.connect.dataframe.DataFrame inherits from pyspark.sql.DataFrame
  • I get this error with pyspark.sql.connect.dataframe.DataFrame and a third-party library (Great Expectations), but not with pyspark.sql.connect.DataFrame  [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "<column name>". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704

To help investigate, I would like to know:

  • What is the difference between pyspark.sql.connect.dataframe.DataFrame and pyspark.sql.DataFrame?
  • What determines if I will get one type of DataFrame or the other?
  • Does pyspark.sql.connect.dataframe.DataFrame have limitations that would lead the issues I have to be expected?

 

7 REPLIES 7

Kaniz_Fatma
Community Manager
Community Manager

Hi @ckarrasexoThe distinction between pyspark.sql.connect.dataframe.DataFrame and pyspark.sql.DataFrame can be a bit confusing.

  1. DataFrame vs. SQL Queries:

    • Both pyspark.sql.connect.dataframe.DataFrame and pyspark.sql.DataFrame represent structured data in Spark, but they are used in slightly different ways.
    • pyspark.sql.DataFrame is the more commonly used API. It provides a higher-level abstraction over the data and is optimized for performance. You can perform various operations on DataFrames using methods like select(), filter(), and groupBy().
    • On the other hand, pyspark.sql.connect.dataframe.DataFrame is associated with SQL queries via the SQLContext. It allows you to execute SQL queries directly against your data. These queries can be more concise and familiar if youโ€™re comfortable with SQL syntax1.
    • In summary, DataFrames are more programmatic and provide better type safety, while SQL queries are concise and portable across different languages.
  2. Type Inference:

    • When you create a DataFrame, Spark infers its schema based on the data. The type of DataFrame you get (pyspark.sql.connect.dataframe.DataFrame or pyspark.sql.DataFrame) depends on how you create it.
    • If you create a DataFrame using SQL queries (e.g., spark.sql("SELECT * FROM my_table")), youโ€™ll get a pyspark.sql.connect.dataframe.DataFrame.
    • If you create a DataFrame using DataFrame methods (e.g., df = spark.read.parquet("my_data.parquet")), youโ€™ll get a pyspark.sql.DataFrame.
  3. Limitations and Considerations:

In your case, the issues youโ€™re encountering with pyspark.sql.connect.dataframe.DataFrame might be related to type inference or specific usage patterns. I recommend reviewing your code and considering whether using DataFrames directly (rather than SQL queries) could help resolve the issues. Additionally, check if any specific optimizations are needed for your use case.

If you encounter further issues, feel free to ask for more assistance! ๐Ÿ˜Š

 

Hi @Kaniz_Fatma,

 

That's incorrect. I use exactly the same code and either get a pyspark.sql.dataframe.DataFrame, or pyspark.sql.connect.dataframe.DataFrame depending on the cluster. It doesn't matter if I create the dataframe using spark.read.table, spark.sql, or even spark.createDataFrame for in-memory data, what changes the class I will get is the cluster configuration.

This screenshot illustrates what I mean. I ran the same notebook on two different clusters and will get a different DataFrame type depending on the cluster. The only difference I can see between the two clusters is that one is a single-user cluster, and the other one is a shared (multi-user) cluster. Both clusters use Databricks 14.3.

ckarrasexo_0-1717164524724.png

So the choice for the class to use is an internal implementation decision by Databricks, and the question is what leads Databricks to pick one or another class, and, considering that they don't appear to be 100% interchangeable, what are the limitations?

Also note that both classes have methods like select, filter, groupBy, cache, persist that can be used the same way with both classes. Both can also be used to run SQL queries or directly read a table without using a query.

What makes the difference is whether the cluster is using Spark Connect or not.
Shared clusters are using Spark Connect, so even the spark session is of different type:

filipniziol_0-1724874549003.png

To compare on single user cluster:

filipniziol_2-1724874812915.png

What I tested is that you can disable Spark Connect on the cluster by setting spark.databricks.service.server.enabled to false, but in this case everything stops working:

filipniziol_1-1724874667480.png

 

 

mchugani
New Contributor II

@Kaniz_Fatma @ckarrasexo Any updates on this? I'm facing the same issue

JSherrill
New Contributor II

@Kaniz_Fatma I am also running into this issue, also with Great Expectations as it happens.  I have also tried using the read paquert like you suggested and am still getting the problematic format.  Is it possible to direct Databricks to create one type, or convert or cast between them?

JSherrill
New Contributor II

Additional info.  In Databricks 13.3, the spark variable we're provided is of type pyspark.sql.SparkSession.  In 15.4 it is created as pyspark.sql.connect.session.SparkSession (both shared clusters; it may behave differently for single node configuration).  

Chris78
New Contributor II

Hitting the same problems trying to check the type of variables to pick out DataFrames.

Ended up getting around this (temporarily at least) by importing the following instead:

from pyspark.sql.connect.dataframe import DataFrame
both 'isinstance(df, DataFrame)' then works again for my dataframe variables that are of type 'pyspark.sql.connect.dataframe.DataFrame'
(if you have already imported from 'from pyspark.sql import DataFrame' you probably need to 'del DataFrame' then redo the import above).
Note that this does however produce a console message as follows, so ymmv:
sc will be removed in future DBR versions

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group