structured streaming hangs when writing or sometimes reading depends on SINGLE USER or shared mode

maoutir — Thu, 04 Jul 2024 15:13:57 GMT

Hi Guys,

I'm new to this community, I am beginning a new project with Azure Databricks and a Python script on my Mac that manipulates data (reading from delta share tables and inserting into local Postgres database ) coming from a remote databricks cluster with single-user data security mode I'm facing a strange error when writing a stream from a dataframe coming from Azure Databricks to a Postgres table:

First I use databricks-connect==14.2.1 to connect to create a session for our databricks cluster; below is the code snippet:

spark = (DatabricksSession.builder.sdkConfig(__config) .remote() .getOrCreate())

Second, I read from the databricks table using the deltaSharing protocol with the change data feed option

df = (spark.readStream.format("deltaSharing") .option("readChangeFeed", "true") .load(my_table_path))

Thirty, I use the above dataframe to create a writeStream job using the micro-batch feature with the foreachBatch method:

(df.writeStream .foreachBatch(process_df18) .outputMode("update") .trigger(processingTime="30 seconds") .option('checkpointLocation', f'{__checkpoint_location}') .start() .awaitTermination()) def process_df18(df, batch_id): # It's not important the method implementation here; the debug breakpoint is not reached here due to the exception that I will be specified after pass

When I run the script I always get this error:

No PYTHON_UID found for the session (a random uuid)

my full dependencies:

alembic==1.13.1 annotated-types==0.7.0 async-timeout==4.0.3 asyncpg==0.29.0 cachetools==5.3.3 certifi==2024.2.2 charset-normalizer==3.3.2 click==8.1.7 colorama==0.4.6 databricks-connect==14.2.1 databricks-sdk==0.28.0 et-xmlfile==1.1.0 google-auth==2.29.0 googleapis-common-protos==1.63.0 greenlet==3.0.3 grpcio==1.64.0 grpcio-status==1.62.2 idna==3.7 kink==0.8.0 loguru==0.7.2 Mako==1.3.5 MarkupSafe==2.1.5 numpy==1.26.4 openpyxl==3.1.3 pandas==2.2.2 protobuf==4.25.3 psycopg2-binary==2.9.9 py4j==0.10.9.7 pyarrow==16.1.0 pyasn1==0.6.0 pyasn1_modules==0.4.0 pydantic==2.5.3 pydantic-settings==2.1.0 pydantic_core==2.14.6 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 pytz==2024.1 requests==2.32.3 rsa==4.9 six==1.16.0 SQLAlchemy==2.0.25 tqdm==4.66.4 typing_extensions==4.12.1 tzdata==2024.1 urllib3==2.2.1 win32-setctime==1.1.0

But when I switch to shared mode I get another error:

[UNSUPPORTED_STREAMING_SOURCE_PERMISSION_ENFORCED] Data source deltaSharing is not supported as a streaming source on a shared cluster. SQLSTATE: 0A000

Versions:

Databricks Runtime: 14.2

Local python installed: 3.10.2

OS: MacOSX 14.0

Any help will be very appreciated and will save my journey.

Thanks.

Re: structured streaming hangs when writing or sometimes reading depends on SINGLE USER or shared mo

szymon_dybczak — Sun, 07 Jul 2024 08:55:24 GMT

Hi,

I reckon I've seen this problem earlier and if I remember correctly there was an issue related to Databricks Connect and single user mode.
I think your code will work if you use it directly in notebook, but will fail with Databricks Connect.
About the second error, it's pretty self explanatory - it looks like Delta Sharing is not supported as a streaming source in shared mode cluster.

And since you are using Databricks runtime >= 14.0, make sure to read about behavior changes for foreachBatch on compute configured with shared access mode.

Use foreachBatch to write to arbitrary data sinks | Databricks on AWS

PS. I managed to find the post with identical case, you can try to double check steps that @Retired_mod posted there.

Error in Spark Streaming with foreachBatch and Dat... - Databricks Community - 68843

topic structured streaming hangs when writing or sometimes reading depends on SINGLE USER or shared mode in Data Engineering

structured streaming hangs when writing or sometimes reading depends on SINGLE USER or shared mode

Re: structured streaming hangs when writing or sometimes reading depends on SINGLE USER or shared mo