structured streaming hangs when writing or sometimes reading depends on SINGLE USER or shared mode
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-04-2024 07:32 AM - edited 07-04-2024 08:13 AM
Hi Guys,
I'm new to this community, I am beginning a new project with Azure Databricks and a Python script on my Mac that manipulates data (reading from delta share tables and inserting into local Postgres database ) coming from a remote databricks cluster with single-user data security mode I'm facing a strange error when writing a stream from a dataframe coming from Azure Databricks to a Postgres table:
First I use databricks-connect==14.2.1 to connect to create a session for our databricks cluster; below is the code snippet:
spark = (DatabricksSession.builder.sdkConfig(__config)
.remote()
.getOrCreate())
Second, I read from the databricks table using the deltaSharing protocol with the change data feed option
df = (spark.readStream.format("deltaSharing")
.option("readChangeFeed", "true")
.load(my_table_path))
Thirty, I use the above dataframe to create a writeStream job using the micro-batch feature with the foreachBatch method:
(df.writeStream
.foreachBatch(process_df18)
.outputMode("update")
.trigger(processingTime="30 seconds")
.option('checkpointLocation', f'{__checkpoint_location}')
.start()
.awaitTermination())
def process_df18(df, batch_id):
# It's not important the method implementation here; the debug breakpoint is not
reached here due to the exception that I will be specified after
pass
When I run the script I always get this error:
No PYTHON_UID found for the session (a random uuid)
my full dependencies:
alembic==1.13.1
annotated-types==0.7.0
async-timeout==4.0.3
asyncpg==0.29.0
cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
databricks-connect==14.2.1
databricks-sdk==0.28.0
et-xmlfile==1.1.0
google-auth==2.29.0
googleapis-common-protos==1.63.0
greenlet==3.0.3
grpcio==1.64.0
grpcio-status==1.62.2
idna==3.7
kink==0.8.0
loguru==0.7.2
Mako==1.3.5
MarkupSafe==2.1.5
numpy==1.26.4
openpyxl==3.1.3
pandas==2.2.2
protobuf==4.25.3
psycopg2-binary==2.9.9
py4j==0.10.9.7
pyarrow==16.1.0
pyasn1==0.6.0
pyasn1_modules==0.4.0
pydantic==2.5.3
pydantic-settings==2.1.0
pydantic_core==2.14.6
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
requests==2.32.3
rsa==4.9
six==1.16.0
SQLAlchemy==2.0.25
tqdm==4.66.4
typing_extensions==4.12.1
tzdata==2024.1
urllib3==2.2.1
win32-setctime==1.1.0
But when I switch to shared mode I get another error:
[UNSUPPORTED_STREAMING_SOURCE_PERMISSION_ENFORCED] Data source deltaSharing is not supported as a streaming source on a shared cluster. SQLSTATE: 0A000
Versions:
Databricks Runtime: 14.2
Local python installed: 3.10.2
OS: MacOSX 14.0
Any help will be very appreciated and will save my journey.
Thanks.
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-07-2024 01:42 AM - edited 07-07-2024 01:55 AM
Hi,
I reckon I've seen this problem earlier and if I remember correctly there was an issue related to Databricks Connect and single user mode.
I think your code will work if you use it directly in notebook, but will fail with Databricks Connect.
About the second error, it's pretty self explanatory - it looks like Delta Sharing is not supported as a streaming source in shared mode cluster.
And since you are using Databricks runtime >= 14.0, make sure to read about behavior changes for foreachBatch on compute configured with shared access mode.
Use foreachBatch to write to arbitrary data sinks | Databricks on AWS
PS. I managed to find the post with identical case, you can try to double check steps that @Retired_mod posted there.
Error in Spark Streaming with foreachBatch and Dat... - Databricks Community - 68843

