cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

structured streaming hangs when writing or sometimes reading depends on SINGLE USER or shared mode

maoutir
New Contributor

Hi Guys,

I'm new to this community, I am beginning a new project with Azure Databricks and a Python script on my Mac that manipulates data (reading from delta share tables and inserting into local Postgres database ) coming from a remote databricks cluster with single-user data security mode I'm facing a strange error when writing a stream from a dataframe coming from Azure Databricks to a Postgres table:

First I use databricks-connect==14.2.1 to connect to create a session for our databricks cluster; below is the code snippet:

 

 

 

 

spark = (DatabricksSession.builder.sdkConfig(__config)
                 .remote()
                 .getOrCreate())

 

 

 

 

Second, I read from the databricks table using the deltaSharing protocol with the change data feed option

 

 

 

 

df = (spark.readStream.format("deltaSharing")
              .option("readChangeFeed", "true")
              .load(my_table_path))

 

 

 

 

Thirty, I use the above dataframe to create a writeStream job using the micro-batch feature with the foreachBatch method:

 

 

 

 

(df.writeStream
.foreachBatch(process_df18)
.outputMode("update")
.trigger(processingTime="30 seconds")
.option('checkpointLocation', f'{__checkpoint_location}')
.start()
.awaitTermination())

def process_df18(df, batch_id):
# It's not important the method implementation here; the debug breakpoint is not
reached here due to the exception that I will be specified after
pass

 

 

 

 

When I run the script I always get this error:

No PYTHON_UID found for the session (a random uuid)

my full dependencies:

 

 

 

 

alembic==1.13.1
annotated-types==0.7.0
async-timeout==4.0.3
asyncpg==0.29.0
cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
databricks-connect==14.2.1
databricks-sdk==0.28.0
et-xmlfile==1.1.0
google-auth==2.29.0
googleapis-common-protos==1.63.0
greenlet==3.0.3
grpcio==1.64.0
grpcio-status==1.62.2
idna==3.7
kink==0.8.0
loguru==0.7.2
Mako==1.3.5
MarkupSafe==2.1.5
numpy==1.26.4
openpyxl==3.1.3
pandas==2.2.2
protobuf==4.25.3
psycopg2-binary==2.9.9
py4j==0.10.9.7
pyarrow==16.1.0
pyasn1==0.6.0
pyasn1_modules==0.4.0
pydantic==2.5.3
pydantic-settings==2.1.0
pydantic_core==2.14.6
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
requests==2.32.3
rsa==4.9
six==1.16.0
SQLAlchemy==2.0.25
tqdm==4.66.4
typing_extensions==4.12.1
tzdata==2024.1
urllib3==2.2.1
win32-setctime==1.1.0

 

 

 

 

But when I switch to shared mode I get another error:

[UNSUPPORTED_STREAMING_SOURCE_PERMISSION_ENFORCED] Data source deltaSharing is not supported as a streaming source on a shared cluster. SQLSTATE: 0A000

Versions:

Databricks Runtime: 14.2

Local python installed: 3.10.2

OS: MacOSX 14.0

Any help will be very appreciated and will save my journey.

Thanks.



1 REPLY 1

szymon_dybczak
Contributor III

Hi,

I reckon I've seen this problem earlier and if I remember correctly there was an issue related to Databricks Connect and single user mode.
I think your code will work if you use it directly in notebook, but will fail with Databricks Connect.
About the second error, it's pretty self explanatory - it looks like Delta Sharing is not supported as a streaming source in shared mode cluster.

And since you are using Databricks runtime >= 14.0, make sure to read about behavior changes for foreachBatch on compute configured with shared access mode.

Use foreachBatch to write to arbitrary data sinks | Databricks on AWS

PS. I managed to find  the post with identical case, you can try to double check steps that @Retired_mod posted there.

Error in Spark Streaming with foreachBatch and Dat... - Databricks Community - 68843

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group