4 weeks ago
I noticed that on some Databricks 14.3 clusters, I get DataFrames with type pyspark.sql.connect.dataframe.DataFrame, while on other clusters also with Databricks 14.3, the exact same code gets DataFrames of type pyspark.sql.DataFrame
pyspark.sql.connect.dataframe.DataFrame seems to be causing various issues.
for example:
To help investigate, I would like to know:
4 weeks ago
Hi @ckarrasexo, The distinction between pyspark.sql.connect.dataframe.DataFrame
and pyspark.sql.DataFrame
can be a bit confusing.
DataFrame vs. SQL Queries:
pyspark.sql.connect.dataframe.DataFrame
and pyspark.sql.DataFrame
represent structured data in Spark, but they are used in slightly different ways.pyspark.sql.DataFrame
is the more commonly used API. It provides a higher-level abstraction over the data and is optimized for performance. You can perform various operations on DataFrames using methods like select()
, filter()
, and groupBy()
.pyspark.sql.connect.dataframe.DataFrame
is associated with SQL queries via the SQLContext
. It allows you to execute SQL queries directly against your data. These queries can be more concise and familiar if youโre comfortable with SQL syntax1.Type Inference:
pyspark.sql.connect.dataframe.DataFrame
or pyspark.sql.DataFrame
) depends on how you create it.spark.sql("SELECT * FROM my_table")
), youโll get a pyspark.sql.connect.dataframe.DataFrame
.df = spark.read.parquet("my_data.parquet")
), youโll get a pyspark.sql.DataFrame
.Limitations and Considerations:
In your case, the issues youโre encountering with pyspark.sql.connect.dataframe.DataFrame
might be related to type inference or specific usage patterns. I recommend reviewing your code and considering whether using DataFrames directly (rather than SQL queries) could help resolve the issues. Additionally, check if any specific optimizations are needed for your use case.
If you encounter further issues, feel free to ask for more assistance! ๐
4 weeks ago
Hi @Kaniz_Fatma,
That's incorrect. I use exactly the same code and either get a pyspark.sql.dataframe.DataFrame, or pyspark.sql.connect.dataframe.DataFrame depending on the cluster. It doesn't matter if I create the dataframe using spark.read.table, spark.sql, or even spark.createDataFrame for in-memory data, what changes the class I will get is the cluster configuration.
This screenshot illustrates what I mean. I ran the same notebook on two different clusters and will get a different DataFrame type depending on the cluster. The only difference I can see between the two clusters is that one is a single-user cluster, and the other one is a shared (multi-user) cluster. Both clusters use Databricks 14.3.
So the choice for the class to use is an internal implementation decision by Databricks, and the question is what leads Databricks to pick one or another class, and, considering that they don't appear to be 100% interchangeable, what are the limitations?
Also note that both classes have methods like select, filter, groupBy, cache, persist that can be used the same way with both classes. Both can also be used to run SQL queries or directly read a table without using a query.
Thursday
I am also facing the issue due that the spark session using databricks connect returns me pyspark.sql.connect.dataframe.DataFrame whereas the data processing library I am using expects pyspark.sql.DataFrame.
Is it possible to convert pyspark.sql.connect.dataframe.DataFrame to pyspark.sql.DataFrame?
Excited to expand your horizons with us? Click here to Register and begin your journey to success!
Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!