cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

MarsSu
by New Contributor II
  • 5614 Views
  • 3 replies
  • 0 kudos

How to implement merge multiple rows in single row with array and do not result in OOM?

Hi, Everyone.Currently I try to implement spark structured streaming with Pyspark. And I would like to merge multiple rows in single row with array and sink to downstream message queue for another service to use. Related example can follow as:* Befor...

  • 5614 Views
  • 3 replies
  • 0 kudos
Latest Reply
917074
New Contributor II
  • 0 kudos

Is there any solution to this, @MarsSu  were you able to solve this, kindly shed some light on this if you resolve this.

  • 0 kudos
2 More Replies
lnights
by New Contributor II
  • 2114 Views
  • 5 replies
  • 2 kudos

High cost of storage when using structured streaming

Hi there, I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that): #read data df = (spark.readStream .format("eventhubs") .options(**ehConf) ...

transactions in azure storage
  • 2114 Views
  • 5 replies
  • 2 kudos
Latest Reply
PetePP
New Contributor II
  • 2 kudos

I had the same problem when starting with databricks. As outlined above, it is the shuffle partitions setting that results in number of files equal to number of partitions. Thus, you are writing low data volume but get taxed on the amount of write (a...

  • 2 kudos
4 More Replies
MarsSu
by New Contributor II
  • 3218 Views
  • 5 replies
  • 1 kudos

Resolved! Databricks job about spark structured streaming zero downtime deployment in terraform.

I would like to ask how to implement zero downtime deployment of spark structured streaming in databricks job compute with terraform. Because we will upgrade spark application code version. But currently we found every deployment will cancel original...

  • 3218 Views
  • 5 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Mars Su​ :Yes, you can implement zero downtime deployment of Spark Structured Streaming in Databricks job compute using Terraform. One way to achieve this is by using Databricks' "job clusters" feature, which allows you to create a cluster specifica...

  • 1 kudos
4 More Replies
pranathisg97
by New Contributor III
  • 1546 Views
  • 2 replies
  • 1 kudos

readStream query throws exception if there's no data in delta location.

Hi,I have a scenario where writeStream query writes the stream data to bronze location and I have to read from bronze, do some processing and finally write it to silver. I use S3 location for delta tablesBut for the very first execution , readStream ...

  • 1546 Views
  • 2 replies
  • 1 kudos
Latest Reply
Vartika
Moderator
  • 1 kudos

Hi @Pranathi Girish​,Hope all is well!Checking in. If @Suteja Kanuri​'s answer helped, would you let us know and mark the answer as best? If not, would you be happy to give us more information?We'd love to hear from you.Thanks!

  • 1 kudos
1 More Replies
adrianlwn
by New Contributor III
  • 6660 Views
  • 18 replies
  • 17 kudos

How to activate ignoreChanges in Delta Live Table read_stream ?

Hello everyone, I'm using DLT (Delta Live Tables) and I've implemented some Change Data Capture for deduplication purposes. Now I am creating a downstream table that will read the DLT as a stream (dlt.read_stream("<tablename>")). I keep receiving thi...

  • 6660 Views
  • 18 replies
  • 17 kudos
Latest Reply
gopínath
New Contributor II
  • 17 kudos

In DLT read_stream, we can't use ignoreChanges / ignoreDeletes. These are the configs helps to avoid the failures but it is actually ignoring the operations done on the upstream. So you need to manually perform the deletes or updates in the downstrea...

  • 17 kudos
17 More Replies
RateVan
by New Contributor II
  • 1230 Views
  • 3 replies
  • 0 kudos

Spark last window dont flush in append mode

The problem is very simple, when you use TUMBLING window with append mode, then the window is closed only when the next message arrives (+watermark logic). In the current implementation, if you stop incoming streaming data, the last window will NEVER...

3P1l3
  • 1230 Views
  • 3 replies
  • 0 kudos
Latest Reply
RateVan
New Contributor II
  • 0 kudos

No, the problem remains the same. The meaning doesn't change because you increased the timeout a little bit. As the window did not close, and does not close until a new message arrives

  • 0 kudos
2 More Replies
Data_Engineer3
by Contributor II
  • 1511 Views
  • 2 replies
  • 0 kudos

Default maximum spark streaming chunk size in delta files in each batch?

working with delta files spark structure streaming , what is the maximum default chunk size in each batch?How do identify this type of spark configuration in databricks?#[Databricks SQL]​ #[Spark streaming]​ #[Spark structured streaming]​ #Spark​ 

  • 1511 Views
  • 2 replies
  • 0 kudos
Latest Reply
NandiniN
Valued Contributor II
  • 0 kudos

Hello @KARTHICK N​ ,The default value for spark.sql.files.maxPartitionBytes is 128 MB. These defaults are in the Apache Spark documentation https://spark.apache.org/docs/latest/sql-performance-tuning.html (unless there might be some overrides).To che...

  • 0 kudos
1 More Replies
pranathisg97
by New Contributor III
  • 914 Views
  • 2 replies
  • 0 kudos

KinesisSource generates empty microbatches when there is no new data.

Is it normal for KinesisSource to generate empty microbatches when there is no new data in Kinesis? Batch 1 finished as there were records in kinesis and BatchId 2 started. BatchId 2 was running but then BatchId 3 started . Even though there was no m...

  • 914 Views
  • 2 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Pranathi Girish​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answe...

  • 0 kudos
1 More Replies
pranathisg97
by New Contributor III
  • 1417 Views
  • 7 replies
  • 0 kudos

Resolved! Fetch new data from kinesis for every minute.

I want to fetch new data from kinesis source for every minute. I'm using "minFetchPeriod" option and specified 60s. But this doesn't seem to be working.Streaming query: spark \ .readStream \ .format("kinesis") \ .option("streamName", kinesis_stream_...

  • 1417 Views
  • 7 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Pranathi Girish​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedb...

  • 0 kudos
6 More Replies
Lulka
by New Contributor II
  • 2307 Views
  • 4 replies
  • 2 kudos

Resolved! How limit input rate reading delta table as stream?

Hello to everyone!I am trying to read delta table as a streaming source using spark. But my microbatches are disbalanced - one very small and the other are very huge. How I can limit this? I used different configurations with maxBytesPerTrigger and m...

  • 2307 Views
  • 4 replies
  • 2 kudos
Latest Reply
Kaniz
Community Manager
  • 2 kudos

Hi @Yuliya Valava​, If you are setting the maxBytesPerTrigger and maxFilesPerTrigger options when reading a Delta table as a stream, but the batch size is not changing, there could be a few reasons for this:The input data rate is not exceeding the li...

  • 2 kudos
3 More Replies
chanansh
by Contributor
  • 806 Views
  • 1 replies
  • 0 kudos

QueryExecutionListener cannot be found in pyspark

According to the documentation you can monitor a spark structure stream job using QueryExecutionListener. However I cannot find it. https://docs.databricks.com/structured-streaming/stream-monitoring.html#language-python

  • 806 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Moderator
  • 0 kudos

Which DBR version are you using? also, can you share some code snippet on how you are using the QueryExecutionListener?

  • 0 kudos
chanansh
by Contributor
  • 672 Views
  • 1 replies
  • 0 kudos

Running stateful spark streaming example fails https://www.databricks.com/blog/2022/10/18/python-arbitrary-stateful-processing-structured-streaming.html

ERROR:py4j.clientserver:There was an exception while executing the Python Proxy on the Python Side. Traceback (most recent call last): File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 617, in _call_proxy retu...

  • 672 Views
  • 1 replies
  • 0 kudos
Latest Reply
Debayan
Esteemed Contributor III
  • 0 kudos

Hi, The error looks like the failure was fetched from the PY configuration? Could you please provide the whole snippet of the error?

  • 0 kudos
Leszek
by Contributor
  • 580 Views
  • 1 replies
  • 3 kudos

How to handle schema changes in streaming Delta Tables?

I'm using Structure Streaming when moving data from one Delta Table to another.How to handle schema changes in those tables (e.g. adding new column)?

  • 580 Views
  • 1 replies
  • 3 kudos
Latest Reply
Murthy1
Contributor II
  • 3 kudos

Hello,I think the only way of handling is to mention the schema within the job through a schema file. The other way is to restart the job to infer the new schema automatically.

  • 3 kudos
gauthamchettiar
by New Contributor II
  • 824 Views
  • 0 replies
  • 1 kudos

Spark always performing broad casts irrespective of spark.sql.autoBroadcastJoinThreshold during streaming merge operation with DeltaTable.

I am trying to do a streaming merge between delta tables using this guide - https://docs.delta.io/latest/delta-update.html#upsert-from-streaming-queries-using-foreachbatchOur Code Sample (Java): Dataset<Row> sourceDf = sparkSession ...

BroadCastJoin 1M
  • 824 Views
  • 0 replies
  • 1 kudos
mattjones
by New Contributor II
  • 268 Views
  • 0 replies
  • 0 kudos

www.meetup.com

DEC 13 MEETUP: Arbitrary Stateful Stream Processing in PySparkFor folks in the Bay Area- Dr. Karthik Ramasamy, Databricks' Head of Streaming, will be joined by engineering experts on the streaming and PySpark teams at Databricks for this in-person me...

  • 268 Views
  • 0 replies
  • 0 kudos
Labels