cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Data_Engineer3
by Contributor III
  • 2534 Views
  • 5 replies
  • 0 kudos

Default maximum spark streaming chunk size in delta files in each batch?

working with delta files spark structure streaming , what is the maximum default chunk size in each batch?How do identify this type of spark configuration in databricks?#[Databricks SQL]​ #[Spark streaming]​ #[Spark structured streaming]​ #Spark​ 

  • 2534 Views
  • 5 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

doc - https://docs.databricks.com/en/structured-streaming/delta-lake.html  Also, what is the challenge while using foreachbatch?

  • 0 kudos
4 More Replies
MarsSu
by New Contributor II
  • 8145 Views
  • 3 replies
  • 0 kudos

How to implement merge multiple rows in single row with array and do not result in OOM?

Hi, Everyone.Currently I try to implement spark structured streaming with Pyspark. And I would like to merge multiple rows in single row with array and sink to downstream message queue for another service to use. Related example can follow as:* Befor...

  • 8145 Views
  • 3 replies
  • 0 kudos
Latest Reply
917074
New Contributor II
  • 0 kudos

Is there any solution to this, @MarsSu  were you able to solve this, kindly shed some light on this if you resolve this.

  • 0 kudos
2 More Replies
lnights
by New Contributor II
  • 4506 Views
  • 5 replies
  • 2 kudos

High cost of storage when using structured streaming

Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data df = (spark.readStream .format("eventhubs") .options(**ehConf) ...

transactions in azure storage
  • 4506 Views
  • 5 replies
  • 2 kudos
Latest Reply
PetePP
New Contributor II
  • 2 kudos

I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...

  • 2 kudos
4 More Replies
MarsSu
by New Contributor II
  • 8833 Views
  • 5 replies
  • 1 kudos

Resolved! Databricks job about spark structured streaming zero downtime deployment in terraform.

I would like to ask how to implement zero downtime deployment of spark structured streaming in databricks job compute with terraform. Because we will upgrade spark application code version. But currently we found every deployment will cancel original...

  • 8833 Views
  • 5 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Mars Su​ :Yes, you can implement zero downtime deployment of Spark Structured Streaming in Databricks job compute using Terraform. One way to achieve this is by using Databricks' "job clusters" feature, which allows you to create a cluster specifica...

  • 1 kudos
4 More Replies
pranathisg97
by New Contributor III
  • 3011 Views
  • 2 replies
  • 1 kudos

readStream query throws exception if there's no data in delta location.

Hi,I have a scenario where writeStream query writes the stream data to bronze location and I have to read from bronze, do some processing and finally write it to silver. I use S3 location for delta tablesBut for the very first execution , readStream ...

  • 3011 Views
  • 2 replies
  • 1 kudos
Latest Reply
Vartika
Databricks Employee
  • 1 kudos

Hi @Pranathi Girish​,Hope all is well!Checking in. If @Suteja Kanuri​'s answer helped, would you let us know and mark the answer as best? If not, would you be happy to give us more information?We'd love to hear from you.Thanks!

  • 1 kudos
1 More Replies
adrianlwn
by New Contributor III
  • 12081 Views
  • 14 replies
  • 16 kudos

How to activate ignoreChanges in Delta Live Table read_stream ?

Hello everyone, I'm using DLT (Delta Live Tables) and I've implemented some Change Data Capture for deduplication purposes. Now I am creating a downstream table that will read the DLT as a stream (dlt.read_stream("<tablename>")). I keep receiving thi...

  • 12081 Views
  • 14 replies
  • 16 kudos
Latest Reply
gopínath
New Contributor II
  • 16 kudos

In DLT read_stream, we can't use ignoreChanges / ignoreDeletes. These are the configs helps to avoid the failures but it is actually ignoring the operations done on the upstream. So you need to manually perform the deletes or updates in the downstrea...

  • 16 kudos
13 More Replies
RateVan
by New Contributor II
  • 2338 Views
  • 3 replies
  • 0 kudos

Spark last window dont flush in append mode

The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...

3P1l3
  • 2338 Views
  • 3 replies
  • 0 kudos
Latest Reply
RateVan
New Contributor II
  • 0 kudos

No, the problem remains the same. The meaning doesn't change because you increased the timeout a little bit. As the window did not close, and does not close until a new message arrives

  • 0 kudos
2 More Replies
pranathisg97
by New Contributor III
  • 1476 Views
  • 2 replies
  • 0 kudos

KinesisSource generates empty microbatches when there is no new data.

Is it normal for KinesisSource to generate empty microbatches when there is no new data in Kinesis? Batch 1 finished as there were records in kinesis and BatchId 2 started. BatchId 2 was running but then BatchId 3 started . Even though there was no m...

  • 1476 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Pranathi Girish​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

  • 0 kudos
1 More Replies
pranathisg97
by New Contributor III
  • 3227 Views
  • 7 replies
  • 0 kudos

Resolved! Fetch new data from kinesis for every minute.

I want to fetch new data from kinesis source for every minute. I'm using "minFetchPeriod" option and specified 60s. But this doesn't seem to be working.Streaming query: spark \ .readStream \ .format("kinesis") \ .option("streamName", kinesis_stream_...

  • 3227 Views
  • 7 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Pranathi Girish​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedb...

  • 0 kudos
6 More Replies
Lulka
by New Contributor II
  • 4444 Views
  • 2 replies
  • 2 kudos

Resolved! How limit input rate reading delta table as stream?

Hello to everyone!I am trying to read delta table as a streaming source using spark. But my microbatches are disbalanced - one very small and the other are very huge. How I can limit this? I used different configurations with maxBytesPerTrigger and m...

  • 4444 Views
  • 2 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

besides the parameters you mention, I don't know of any other which controls the batch size.did you check if the delta table is not horribly skewed?

  • 2 kudos
1 More Replies
chanansh
by Contributor
  • 1340 Views
  • 1 replies
  • 0 kudos

QueryExecutionListener cannot be found in pyspark

According to the documentation you can monitor a spark structure stream job using QueryExecutionListener. However I cannot find it. https://docs.databricks.com/structured-streaming/stream-monitoring.html#language-python

  • 1340 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

Which DBR version are you using? also, can you share some code snippet on how you are using the QueryExecutionListener?

  • 0 kudos
chanansh
by Contributor
  • 1189 Views
  • 1 replies
  • 0 kudos

Running stateful spark streaming example fails https://www.databricks.com/blog/2022/10/18/python-arbitrary-stateful-processing-structured-streaming.html

ERROR:py4j.clientserver:There was an exception while executing the Python Proxy on the Python Side. Traceback (most recent call last): File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 617, in _call_proxy retu...

  • 1189 Views
  • 1 replies
  • 0 kudos
Latest Reply
Debayan
Databricks Employee
  • 0 kudos

Hi, The error looks like the failure was fetched from the PY configuration? Could you please provide the whole snippet of the error?

  • 0 kudos
Leszek
by Contributor
  • 1034 Views
  • 1 replies
  • 3 kudos

How to handle schema changes in streaming Delta Tables?

I'm using Structure Streaming when moving data from one Delta Table to another.How to handle schema changes in those tables (e.g. adding new column)?

  • 1034 Views
  • 1 replies
  • 3 kudos
Latest Reply
Murthy1
Contributor II
  • 3 kudos

Hello,I think the only way of handling is to mention the schema within the job through a schema file. The other way is to restart the job to infer the new schema automatically.

  • 3 kudos
gauthamchettiar
by New Contributor II
  • 1702 Views
  • 0 replies
  • 1 kudos

Spark always performing broad casts irrespective of spark.sql.autoBroadcastJoinThreshold during streaming merge operation with DeltaTable.

I am trying to do a streaming merge between delta tables using this guide - https://docs.delta.io/latest/delta-update.html#upsert-from-streaming-queries-using-foreachbatchOur Code Sample (Java): Dataset<Row> sourceDf = sparkSession ...

BroadCastJoin 1M
  • 1702 Views
  • 0 replies
  • 1 kudos
mattjones
by New Contributor II
  • 534 Views
  • 0 replies
  • 0 kudos

www.meetup.com

DEC 13 MEETUP: Arbitrary Stateful Stream Processing in PySparkFor folks in the Bay Area- Dr. Karthik Ramasamy, Databricks' Head of Streaming, will be joined by engineering experts on the streaming and PySpark teams at Databricks for this in-person me...

  • 534 Views
  • 0 replies
  • 0 kudos
Labels