Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...
I am trying to connect to my Kafka from spark but getting an error:Kafka Version: 2.4.1Spark Version: 3.3.0I am using jupyter notebook to execute the pyspark code below:```from pyspark.sql.functions import *from pyspark.sql.types import *#import libr...
Issue: Spark structured streaming applicationAfter adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/...
Issue:After adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/api/v1/applications/app-id/streaming/st...
working with delta files spark structure streaming , what is the maximum default chunk size in each batch?How do identify this type of spark configuration in databricks?#[Databricks SQL] #[Spark streaming] #[Spark structured streaming] #Spark
I am attempting to stream JSON endpoint responses from an s3 bucket into a spark DLT. I have been very successful in this practice previously, but the difference this time is that I am storing the responses from multiple endpoints in the same s3 buck...
@Carter Mooring Thank you SO MUCH for coming back to provide a solution to your thread! Happy you were able to figure this out so quickly. And I am sure that this will help someone in the future with the same issue.
Hi, I want to keep track of the streaming lag from the source table, which is a delta table. I see that in query progress logs, there is some information about the last version and the last file in the version for the end offset, but this don't give ...
Hey @Yerachmiel Feltzman I hope all is well.Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.Thanks!
Hi All,I'm facing an issue with my Spark Streaming Job. It gets stuck in the "Stream Initializing" phase for more than 3 hours.Need your help here to understand what happens internally at the "Stream Initializing" phase of the Spark Streaming job tha...
I'm facing the same issue when I try to run this example Create a monitor using the API | Databricks on AWS (Inference Lakehouse Monitor regression example notebook). any idea?
Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data
df = (spark.readStream
.format("eventhubs")
.options(**ehConf)
...
I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...
I have an continuous running streaming Job, I would like to stop this over weekend and start again on Monday. Here is my streaming job code.(spark.readStream.format("delta").load(input_path).writeStream.option("checkpointLocation", input_checkpoint_p...
I am new to real time scenarios and I need to create a spark structured streaming jobs in databricks. I am trying to apply some rule based validations from backend configurations on each incoming JSON message. I need to do the following actions on th...
Hi All, I recently published a streaming data comparison between Snowflake and Databricks. Hope you enjoy! Please let me know what you think! https://medium.com/@24chynoweth/data-streaming-at-scale-databricks-and-snowflake-ca65a2401649
I am reading the source table which gets updated every day. It is usually append/merge with updates and is occasionally overwritten for other reasons. df = spark.readStream.schema(schema).format("delta").option("ignoreChanges", True).option('starting...
Hi @Ravi Vishwakarma Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answ...
I am trying to read stream from azure:(spark.readStream
.format("cloudFiles")
.option('cloudFiles.clientId', CLIENT_ID)
.option('cloudFiles.clientSecret', CLIENT_SECRET)
.option('cloudFiles.tenantId', TENTANT_ID)
.option("header", "true")
.opti...
@Hanan Shteingart :It looks like you're using the Azure Blob Storage connector for Spark to read data from Azure. The error message suggests that the credentials you provided are not being used by the connector.To specify the credentials, you can se...