2 weeks ago
Hello,
we employ arbitrary stateful aggregations in our data processing streams on Azure Databricks, and would like to migrate from applyInPandasWithState to transformWithStateInPandas. We employ the Python API throughout our solution, and some of our workspaces have NOT yet Unity Catalog enabled.
Trying to run the examples provided in the Azure Databricks documentation, e.g., the SCD Type 2 Example, on the workspaces without Unity Catalog enabled, I get the following error:
The cluster configuration is as follows:
To my understanding, this setup fullfils the requirements for using transformWithStateInPandas (DBR > 16.2, compute using "single user"/"dedicated" or "no isolation shared" access mode, using RocksDB as state store provider).
I also tested other examples, they all result in the same error when trying to start the stream.
The exact same example with identical cluster configuration works in our Unity-enabled workspaces.
What did I miss? Why is the spark connect directory not ready on the workspace that has Unity Catalog not enabled?
Best and thanks!
Felix
2 weeks ago
can you share your stream config (write location anonimized etc)?
2 weeks ago
Dear werners,
thank you for your swift response. I use the notebook provided in the example (with a different storage path, of course). The stream config is included.
Best!
2 weeks ago - last edited 2 weeks ago
Maybe it's a more of a problem with Databricks Connect which is not supported on non UC enabled cluster
Compute configuration for Databricks Connect - Azure Databricks | Microsoft Learn
2 weeks ago
https://www.databricks.com/blog/introducing-transformwithstate-apache-sparktm-structured-streaming
Here they specifically mention Unity Catalog clusters (see Availability section), even though in the release notes this is not mentioned as a requirement. But it could very well be the case since UC is the way to go in the later Databricks releases.
Perhaps someone at Databricks can confirm/deny this?
2 weeks ago
Dear @szymon_dybczak and @-werners- ,
thank you a lot for for your responses and references!
@-werners- , thank you for the link to the announcement article. The availability section lists that "No-Isolation and Unity Catalog Dedicated Clusters" are supported. No-isolation access mode is to my understanding not compatible with Unity Catalog. As transformWithStateInPandas supports this access mode, I would assume it can run without Unity Catalog.
This leads me back to the question why the examples are failing in the above-described setup.
I would also be curious on a Databricks reponse on this.
2 weeks ago
Hello @felix4572!
Could you please share the driver log, or even better, the executor log (without any sensitive details)?
a week ago
Update: This is working fine with earlier DBR versions, but the issue seems to occur specifically with DBR 17.1.
I’ve flagged this behaviour with the internal team for further investigation.
Monday
Thanks @Advika for update. If you find anything else from internal team, please let us know 😉
Monday
Thanks a lot for working on this, @Advika. For now, the workaround to use DBR versions other than 17.1 works for me. Mid-term it would be of course great to use transformWithStateInPandas irrespective of the cluster DBR (as long as the minimum requirements are met).
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now