Databricks Community

maoutir · ‎07-04-2024

Hi Guys,

I'm new to this community, I am beginning a new project with Azure Databricks and a Python script on my Mac that manipulates data (reading from delta share tables and inserting into local Postgres database ) coming from a remote databricks cluster with single-user data security mode I'm facing a strange error when writing a stream from a dataframe coming from Azure Databricks to a Postgres table:

First I use databricks-connect==14.2.1 to connect to create a session for our databricks cluster; below is the code snippet:

spark = (DatabricksSession.builder.sdkConfig(__config)
                 .remote()
                 .getOrCreate())

Second, I read from the databricks table using the deltaSharing protocol with the change data feed option

df = (spark.readStream.format("deltaSharing")
              .option("readChangeFeed", "true")
              .load(my_table_path))

Thirty, I use the above dataframe to create a writeStream job using the micro-batch feature with the foreachBatch method:

(df.writeStream
.foreachBatch(process_df18)
.outputMode("update")
.trigger(processingTime="30 seconds")
.option('checkpointLocation', f'{__checkpoint_location}')
.start()
.awaitTermination())

def process_df18(df, batch_id):
# It's not important the method implementation here; the debug breakpoint is not
reached here due to the exception that I will be specified after
pass

When I run the script I always get this error:

No PYTHON_UID found for the session (a random uuid)

my full dependencies:

alembic==1.13.1
annotated-types==0.7.0
async-timeout==4.0.3
asyncpg==0.29.0
cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
databricks-connect==14.2.1
databricks-sdk==0.28.0
et-xmlfile==1.1.0
google-auth==2.29.0
googleapis-common-protos==1.63.0
greenlet==3.0.3
grpcio==1.64.0
grpcio-status==1.62.2
idna==3.7
kink==0.8.0
loguru==0.7.2
Mako==1.3.5
MarkupSafe==2.1.5
numpy==1.26.4
openpyxl==3.1.3
pandas==2.2.2
protobuf==4.25.3
psycopg2-binary==2.9.9
py4j==0.10.9.7
pyarrow==16.1.0
pyasn1==0.6.0
pyasn1_modules==0.4.0
pydantic==2.5.3
pydantic-settings==2.1.0
pydantic_core==2.14.6
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
requests==2.32.3
rsa==4.9
six==1.16.0
SQLAlchemy==2.0.25
tqdm==4.66.4
typing_extensions==4.12.1
tzdata==2024.1
urllib3==2.2.1
win32-setctime==1.1.0

But when I switch to shared mode I get another error:

[UNSUPPORTED_STREAMING_SOURCE_PERMISSION_ENFORCED] Data source deltaSharing is not supported as a streaming source on a shared cluster. SQLSTATE: 0A000

Versions:

Databricks Runtime: 14.2

Local python installed: 3.10.2

OS: MacOSX 14.0

Any help will be very appreciated and will save my journey.

Thanks.

szymon_dybczak · ‎07-07-2024

Hi,

I reckon I've seen this problem earlier and if I remember correctly there was an issue related to Databricks Connect and single user mode.
I think your code will work if you use it directly in notebook, but will fail with Databricks Connect.
About the second error, it's pretty self explanatory - it looks like Delta Sharing is not supported as a streaming source in shared mode cluster.

And since you are using Databricks runtime >= 14.0, make sure to read about behavior changes for foreachBatch on compute configured with shared access mode.

Use foreachBatch to write to arbitrary data sinks | Databricks on AWS

PS. I managed to find the post with identical case, you can try to double check steps that @Retired_mod posted there.

Error in Spark Streaming with foreachBatch and Dat... - Databricks Community - 68843