Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...
Issue: Spark structured streaming applicationAfter adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/...
Issue:After adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/api/v1/applications/app-id/streaming/st...
working with delta files spark structure streaming , what is the maximum default chunk size in each batch?How do identify this type of spark configuration in databricks?#[Databricks SQL] #[Spark streaming] #[Spark structured streaming] #Spark
Hi, Everyone.Currently I try to implement spark structured streaming with Pyspark. And I would like to merge multiple rows in single row with array and sink to downstream message queue for another service to use. Related example can follow as:* Befor...
Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data
df = (spark.readStream
.format("eventhubs")
.options(**ehConf)
...
I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...
I would like to ask how to implement zero downtime deployment of spark structured streaming in databricks job compute with terraform. Because we will upgrade spark application code version. But currently we found every deployment will cancel original...
@Mars Su :Yes, you can implement zero downtime deployment of Spark Structured Streaming in Databricks job compute using Terraform. One way to achieve this is by using Databricks' "job clusters" feature, which allows you to create a cluster specifica...
Hi,I have a scenario where writeStream query writes the stream data to bronze location and I have to read from bronze, do some processing and finally write it to silver. I use S3 location for delta tablesBut for the very first execution , readStream ...
Hi @Pranathi Girish,Hope all is well!Checking in. If @Suteja Kanuri's answer helped, would you let us know and mark the answer as best? If not, would you be happy to give us more information?We'd love to hear from you.Thanks!
Hello everyone, I'm using DLT (Delta Live Tables) and I've implemented some Change Data Capture for deduplication purposes. Now I am creating a downstream table that will read the DLT as a stream (dlt.read_stream("<tablename>")). I keep receiving thi...
In DLT read_stream, we can't use ignoreChanges / ignoreDeletes. These are the configs helps to avoid the failures but it is actually ignoring the operations done on the upstream. So you need to manually perform the deletes or updates in the downstrea...
Is it normal for KinesisSource to generate empty microbatches when there is no new data in Kinesis? Batch 1 finished as there were records in kinesis and BatchId 2 started. BatchId 2 was running but then BatchId 3 started . Even though there was no m...
Hi @Pranathi Girish Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...
I want to fetch new data from kinesis source for every minute. I'm using "minFetchPeriod" option and specified 60s. But this doesn't seem to be working.Streaming query: spark \ .readStream \ .format("kinesis") \ .option("streamName", kinesis_stream_...
Hi @Pranathi Girish Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedb...
Hello to everyone!I am trying to read delta table as a streaming source using spark. But my microbatches are disbalanced - one very small and the other are very huge. How I can limit this? I used different configurations with maxBytesPerTrigger and m...
According to the documentation you can monitor a spark structure stream job using QueryExecutionListener. However I cannot find it. https://docs.databricks.com/structured-streaming/stream-monitoring.html#language-python
ERROR:py4j.clientserver:There was an exception while executing the Python Proxy on the Python Side.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 617, in _call_proxy
retu...
Hello,I think the only way of handling is to mention the schema within the job through a schema file. The other way is to restart the job to infer the new schema automatically.