Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Hey,I'm trying to perform Time window aggregation in two different streams followed by stream-stream window join described here. I'm running Databricks Runtime 13.1, exactly as advised.However, when I'm reproducing the following code:clicksWindow = c...
I am trying to read stream from azure:(spark.readStream
.format("cloudFiles")
.option('cloudFiles.clientId', CLIENT_ID)
.option('cloudFiles.clientSecret', CLIENT_SECRET)
.option('cloudFiles.tenantId', TENTANT_ID)
.option("header", "true")
.opti...
@Hanan Shteingart :It looks like you're using the Azure Blob Storage connector for Spark to read data from Azure. The error message suggests that the credentials you provided are not being used by the connector.To specify the credentials, you can se...
I'm trying to build gold level streaming live table based on two streaming silver live tables with left join.This attempt fails with the next error:"Append mode error: Stream-stream LeftOuter join between two streaming DataFrame/Datasets is not suppo...
I ma trying to stream kafka events on databricks but it keeps initializing for hours and don't give any output can someone help what is actually happening and why data is not publishing? I couldn't find anything for this on community.
Hi, I am a little confused when I should use STREAM() when we define a table based on a DLT table. There is a pattern explained in the documentation. CREATE OR REFRESH STREAMING LIVE TABLE streaming_bronze
AS SELECT * FROM cloud_files(
"s3://p...
Thanks @Landan George Since "streaming_silver" is a streaming live table, I expected the last line of the code to be:AS SELECT count(*) FROM STREAM(LIVE.streaming_silver) GROUP BY user_idBut, as you can see the "live_gold" is defined by: AS SELECT c...
I am having trouble efficiently reading & parsing in a large number of stream files in Pyspark!
Context
Here is the schema of the stream file that I am reading in JSON. Blank spaces are edits for confidentiality purposes.
root
|-- location_info: ar...
I'm interested in seeing what others have come up with. Currently I'm using Json. normalize() then taking any additional nested statements and using a loop to pull them out -> re-combine them.
Hi Team I am setting up the Kafka cluster on databricks to ingest the data on delta, but it seems like the cluster is running from last 2 hours but still, the stream is not started and I am not seeing any failure also.