Databricks Community

Olly · ‎05-16-2024

Having issues with the pyspark DataFrames returned by delta.DeltaTable.toDF(), in what I believe is specific to shared access clusters on DBR14.3. Recently created a near identical workflow with the only major difference being that one of the source tables in a process is a federated table so we switched the job cluster to a shared access cluster so we could access it.

Part of the downstream process is a merge operation so we used the delta library directly and used DeltaTable.toDF() so get the dataframe, but a fair number of commands seem to be bust when using it afterwards, and cells get marked as having successfully complete even though its spitting out GRPC errors, and actions fail entirely, with quite obscure errors.

test data

(
    spark.createDataFrame([(1,)], "col: int")
    .write.format("delta")
    .mode("overwrite")
    .saveAsTable("test")
)

delta commands

import delta

df = delta.DeltaTable.forName(spark, "test").toDF()

df.select(df.col)

The cell with above appears to work but spits out this under the cell

2024-05-16 12:20:19,869 1872 ERROR _handle_rpc_error GRPC Error received Traceback (most recent call last): File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1389, in _analyze resp = self._stub.AnalyzePlan(req, metadata=self._builder.metadata()) File "/databricks/python/lib/python3.10/site-packages/grpc/_channel.py", line 946, in __call__ return _end_unary_response_blocking(state, call, False, None) File "/databricks/python/lib/python3.10/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INTERNAL details = "[CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "col". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704" debug_error_string = "UNKNOWN:Error received from peer unix:/databricks/sparkconnect/grpc.sock {grpc_message:"[CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column \"col\". It\'s probably because of illegal references like `df1.select(df2.col(\"a\"))`. SQLSTATE: 42704", grpc_status:13, created_time:"2024-05-16T12:20:19.869273292+00:00"}" >

doing an action on it makes it error, e.g. .display(), where it says it can't resolve the column.

Using spark.table() directly works as expected. DeltaTable.merge() seems to work fine with spark.table() so its quite straight forward to get around but still a nuisance as the behaviour is unusual. The original workflow on single access seems to work fine, so I'm guessing that this is an incompatibility with spark-connect in delta more than anything

shan_chandra · ‎05-16-2024

Hi @Olly - can you please try the following?

import delta
from pyspark.sql.functions import col

df = delta.DeltaTable.forName(spark, "test111").toDF()
df.select(col("col"))

View solution in original post

shan_chandra · ‎05-16-2024

Hi @Olly - can you please try the following?

import delta
from pyspark.sql.functions import col

df = delta.DeltaTable.forName(spark, "test111").toDF()
df.select(col("col"))

Olly · ‎05-21-2024

That works, as mentioned it is easy to work around. as does replacing

df = spark.table("test")
df.select(df.col)

Databricks Community

DBR14.3 Shared Access cluster delta.DeltaTable.toDF() issues

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!