Having issues with the pyspark DataFrames returned by delta.DeltaTable.toDF(), in what I believe is specific to shared access clusters on DBR14.3. Recently created a near identical workflow with the only major difference being that one of the source tables in a process is a federated table so we switched the job cluster to a shared access cluster so we could access it.
Part of the downstream process is a merge operation so we used the delta library directly and used DeltaTable.toDF() so get the dataframe, but a fair number of commands seem to be bust when using it afterwards, and cells get marked as having successfully complete even though its spitting out GRPC errors, and actions fail entirely, with quite obscure errors.
test data
(
spark.createDataFrame([(1,)], "col: int")
.write.format("delta")
.mode("overwrite")
.saveAsTable("test")
)
delta commands
import delta
df = delta.DeltaTable.forName(spark, "test").toDF()
df.select(df.col)
The cell with above appears to work but spits out this under the cell
2024-05-16 12:20:19,869 1872 ERROR _handle_rpc_error GRPC Error received Traceback (most recent call last): File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1389, in _analyze resp = self._stub.AnalyzePlan(req, metadata=self._builder.metadata()) File "/databricks/python/lib/python3.10/site-packages/grpc/_channel.py", line 946, in __call__ return _end_unary_response_blocking(state, call, False, None) File "/databricks/python/lib/python3.10/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INTERNAL details = "[CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "col". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704" debug_error_string = "UNKNOWN:Error received from peer unix:/databricks/sparkconnect/grpc.sock {grpc_message:"[CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column \"col\". It\'s probably because of illegal references like `df1.select(df2.col(\"a\"))`. SQLSTATE: 42704", grpc_status:13, created_time:"2024-05-16T12:20:19.869273292+00:00"}" >
doing an action on it makes it error, e.g. .display(), where it says it can't resolve the column.
Using spark.table() directly works as expected. DeltaTable.merge() seems to work fine with spark.table() so its quite straight forward to get around but still a nuisance as the behaviour is unusual. The original workflow on single access seems to work fine, so I'm guessing that this is an incompatibility with spark-connect in delta more than anything