cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

RateVan
by New Contributor II
  • 2623 Views
  • 4 replies
  • 0 kudos

Spark last window dont flush in append mode

The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...

3P1l3
  • 2623 Views
  • 4 replies
  • 0 kudos
Latest Reply
Dtank
New Contributor II
  • 0 kudos

Do you have any solution for this ?

  • 0 kudos
3 More Replies
swetha
by New Contributor III
  • 3122 Views
  • 4 replies
  • 1 kudos

Error: no streaming listener attached to the spark app is the error we are observing post accessing streaming statistics API. Please help us with this issue ASAP. Thanks.

Issue: Spark structured streaming applicationAfter adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/...

  • 3122 Views
  • 4 replies
  • 1 kudos
Latest Reply
INJUSTIC
New Contributor II
  • 1 kudos

Have you found the solution? Thanks

  • 1 kudos
3 More Replies
swetha
by New Contributor III
  • 2720 Views
  • 3 replies
  • 1 kudos

I am unable to attach a streaming listener to a spark streaming job. Error: no streaming listener attached to the spark application is the error we are observing post accessing streaming statistics API. Please help us with this issue ASAP. Thanks.

Issue:After adding the listener jar file in the cluster init script, the listener is working (From what I see in the stdout/log4j logs)But when I try to hit the 'Content-Type: application/json' http://host:port/api/v1/applications/app-id/streaming/st...

  • 2720 Views
  • 3 replies
  • 1 kudos
Latest Reply
INJUSTIC
New Contributor II
  • 1 kudos

Have you found the solution? Thanks

  • 1 kudos
2 More Replies
Data_Engineer3
by Contributor III
  • 2955 Views
  • 5 replies
  • 0 kudos

Default maximum spark streaming chunk size in delta files in each batch?

working with delta files spark structure streaming , what is the maximum default chunk size in each batch?How do identify this type of spark configuration in databricks?#[Databricks SQL]​ #[Spark streaming]​ #[Spark structured streaming]​ #Spark​ 

  • 2955 Views
  • 5 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

doc - https://docs.databricks.com/en/structured-streaming/delta-lake.html  Also, what is the challenge while using foreachbatch?

  • 0 kudos
4 More Replies
MarsSu
by New Contributor II
  • 8601 Views
  • 3 replies
  • 0 kudos

How to implement merge multiple rows in single row with array and do not result in OOM?

Hi, Everyone.Currently I try to implement spark structured streaming with Pyspark. And I would like to merge multiple rows in single row with array and sink to downstream message queue for another service to use. Related example can follow as:* Befor...

  • 8601 Views
  • 3 replies
  • 0 kudos
Latest Reply
917074
New Contributor II
  • 0 kudos

Is there any solution to this, @MarsSu  were you able to solve this, kindly shed some light on this if you resolve this.

  • 0 kudos
2 More Replies
lnights
by New Contributor II
  • 4886 Views
  • 5 replies
  • 2 kudos

High cost of storage when using structured streaming

Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data df = (spark.readStream .format("eventhubs") .options(**ehConf) ...

transactions in azure storage
  • 4886 Views
  • 5 replies
  • 2 kudos
Latest Reply
PetePP
New Contributor II
  • 2 kudos

I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...

  • 2 kudos
4 More Replies
MarsSu
by New Contributor II
  • 9117 Views
  • 5 replies
  • 1 kudos

Resolved! Databricks job about spark structured streaming zero downtime deployment in terraform.

I would like to ask how to implement zero downtime deployment of spark structured streaming in databricks job compute with terraform. Because we will upgrade spark application code version. But currently we found every deployment will cancel original...

  • 9117 Views
  • 5 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Mars Su​ :Yes, you can implement zero downtime deployment of Spark Structured Streaming in Databricks job compute using Terraform. One way to achieve this is by using Databricks' "job clusters" feature, which allows you to create a cluster specifica...

  • 1 kudos
4 More Replies
pranathisg97
by New Contributor III
  • 3335 Views
  • 2 replies
  • 1 kudos

readStream query throws exception if there's no data in delta location.

Hi,I have a scenario where writeStream query writes the stream data to bronze location and I have to read from bronze, do some processing and finally write it to silver. I use S3 location for delta tablesBut for the very first execution , readStream ...

  • 3335 Views
  • 2 replies
  • 1 kudos
Latest Reply
Vartika
Databricks Employee
  • 1 kudos

Hi @Pranathi Girish​,Hope all is well!Checking in. If @Suteja Kanuri​'s answer helped, would you let us know and mark the answer as best? If not, would you be happy to give us more information?We'd love to hear from you.Thanks!

  • 1 kudos
1 More Replies
adrianlwn
by New Contributor III
  • 13330 Views
  • 14 replies
  • 16 kudos

How to activate ignoreChanges in Delta Live Table read_stream ?

Hello everyone, I'm using DLT (Delta Live Tables) and I've implemented some Change Data Capture for deduplication purposes. Now I am creating a downstream table that will read the DLT as a stream (dlt.read_stream("<tablename>")). I keep receiving thi...

  • 13330 Views
  • 14 replies
  • 16 kudos
Latest Reply
gopínath
New Contributor II
  • 16 kudos

In DLT read_stream, we can't use ignoreChanges / ignoreDeletes. These are the configs helps to avoid the failures but it is actually ignoring the operations done on the upstream. So you need to manually perform the deletes or updates in the downstrea...

  • 16 kudos
13 More Replies
pranathisg97
by New Contributor III
  • 1564 Views
  • 2 replies
  • 0 kudos

KinesisSource generates empty microbatches when there is no new data.

Is it normal for KinesisSource to generate empty microbatches when there is no new data in Kinesis? Batch 1 finished as there were records in kinesis and BatchId 2 started. BatchId 2 was running but then BatchId 3 started . Even though there was no m...

  • 1564 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Pranathi Girish​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

  • 0 kudos
1 More Replies
pranathisg97
by New Contributor III
  • 3534 Views
  • 7 replies
  • 0 kudos

Resolved! Fetch new data from kinesis for every minute.

I want to fetch new data from kinesis source for every minute. I'm using "minFetchPeriod" option and specified 60s. But this doesn't seem to be working.Streaming query: spark \ .readStream \ .format("kinesis") \ .option("streamName", kinesis_stream_...

  • 3534 Views
  • 7 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Pranathi Girish​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedb...

  • 0 kudos
6 More Replies
Lulka
by New Contributor II
  • 4734 Views
  • 2 replies
  • 2 kudos

Resolved! How limit input rate reading delta table as stream?

Hello to everyone!I am trying to read delta table as a streaming source using spark. But my microbatches are disbalanced - one very small and the other are very huge. How I can limit this? I used different configurations with maxBytesPerTrigger and m...

  • 4734 Views
  • 2 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

besides the parameters you mention, I don't know of any other which controls the batch size.did you check if the delta table is not horribly skewed?

  • 2 kudos
1 More Replies
chanansh
by Contributor
  • 1449 Views
  • 1 replies
  • 0 kudos

QueryExecutionListener cannot be found in pyspark

According to the documentation you can monitor a spark structure stream job using QueryExecutionListener. However I cannot find it. https://docs.databricks.com/structured-streaming/stream-monitoring.html#language-python

  • 1449 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

Which DBR version are you using? also, can you share some code snippet on how you are using the QueryExecutionListener?

  • 0 kudos
chanansh
by Contributor
  • 1334 Views
  • 1 replies
  • 0 kudos

Running stateful spark streaming example fails https://www.databricks.com/blog/2022/10/18/python-arbitrary-stateful-processing-structured-streaming.html

ERROR:py4j.clientserver:There was an exception while executing the Python Proxy on the Python Side. Traceback (most recent call last): File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 617, in _call_proxy retu...

  • 1334 Views
  • 1 replies
  • 0 kudos
Latest Reply
Debayan
Databricks Employee
  • 0 kudos

Hi, The error looks like the failure was fetched from the PY configuration? Could you please provide the whole snippet of the error?

  • 0 kudos
Leszek
by Contributor
  • 1128 Views
  • 1 replies
  • 3 kudos

How to handle schema changes in streaming Delta Tables?

I'm using Structure Streaming when moving data from one Delta Table to another.How to handle schema changes in those tables (e.g. adding new column)?

  • 1128 Views
  • 1 replies
  • 3 kudos
Latest Reply
Murthy1
Contributor II
  • 3 kudos

Hello,I think the only way of handling is to mention the schema within the job through a schema file. The other way is to restart the job to infer the new schema automatically.

  • 3 kudos
Labels