cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Abdus
by New Contributor
  • 582 Views
  • 1 replies
  • 0 kudos

Apache spark Streaming

When was the last commit done on Spark Streaming

  • 582 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz
Community Manager
  • 0 kudos

Hi @Abdus! My name is Kaniz, and I'm a technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a response.

  • 0 kudos
User16826994223
by Honored Contributor III
  • 712 Views
  • 1 replies
  • 0 kudos

Delta Table to Spark Streaming to Synapse Table in azure databricks

Is there a way to keep my synapse database always in sync with latest data from delta table, My synapse database I believe doesn't support the stream as sink, can i get any workaround

  • 712 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16826994223
Honored Contributor III
  • 0 kudos

You could try to keep the data in sync by appending the new data dataframe in a forEachBatch on your write stream, this method allows for arbitrary ways to write data, you can connect to the Datawarehouse with jdbc if necessary:with your batch functi...

  • 0 kudos
User16869510359
by Esteemed Contributor
  • 622 Views
  • 1 replies
  • 0 kudos

Resolved! Why is my streaming job not resuming even though I specified checkpoint directory

I have provided the checkpointLocation as below, however I see the config is ignored for my streaming queryoption("checkpointLocation", "path/to/checkpoint/dir")

  • 622 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

This is a common question from many users. If the streaming checkpoint directory is specified correctly then this behavior is expected. Below is an example of specifying the checkpoint correctlydf.writeStream   .format("parquet")   .option("checkpo...

  • 0 kudos
User16869510359
by Esteemed Contributor
  • 671 Views
  • 1 replies
  • 0 kudos

Resolved! Is there any way to control the autoOptimize interval?

I can see my streaming jobs running optimize jobs more frequently, Is there any property so I can control autoOptimize duration

  • 671 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

The autoOptimize is not performed on a time basis. It's an event-based trigger. Once the delta table/partition has 50 (default value of spark.databricks.delta.autoCompact.minNumFiles) files, auto-compaction is triggered. To reduce the frequency, inc...

  • 0 kudos
User16869510359
by Esteemed Contributor
  • 1080 Views
  • 1 replies
  • 0 kudos

Resolved! Why do I see data loss with Structured streaming jobs?

I have a Spark structured streaming job reading data from Kafka and loading it to the Delta table. I have some transformations and aggregations on the streaming data before writing to Delta table

  • 1080 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

The typical reason for data loss on a Structured streaming application is having an incorrect value set for watermarking. The watermarking is done to ensure the application does not develop the state for a long period, However, it should be ensured ...

  • 0 kudos
User16869510359
by Esteemed Contributor
  • 1683 Views
  • 1 replies
  • 0 kudos

Resolved! Is it mandatory to checkpoint my streaming query.

I have ad-hoc one-time streaming queries where I believe checkpoint won't give any value add. Should I still use checkpointing

  • 1683 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

It's not mandatory. But the strong recommendation is to use Checkpointing for Streaming irrespective of your use case. This is because the default checkpoint location can get a lot of files over time as there is no graceful guaranteed cleaning in pla...

  • 0 kudos
User16783855534
by New Contributor III
  • 872 Views
  • 2 replies
  • 0 kudos

Should/Can I use spark streaming for Batch workloads?

Its preferable to use spark streaming (with Delta) for batch workloads rather then regular batch. With the trigger.once trigger whenever the streaming job is started it will process whatever is available in the source (kafka/kinesis/File System) and ...

  • 872 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

The streaming checkpoint mechanism is independent of the Trigger type. The way checkpoint works are it creates an offset file when processing the batch and once the batch is completed it creates a commit file for that batch in the checkpoint director...

  • 0 kudos
1 More Replies
User16869510359
by Esteemed Contributor
  • 602 Views
  • 1 replies
  • 0 kudos

How to migrate to Auto-loader without downtime?

I have an S3-SQS workload. Is it possible to migrate the workload to autoloader without downtime? What are the migration guidelines.

  • 602 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

The SQS queue used by the existing application can be utilized by the auto-loader thereby ensuring minimal downtime

  • 0 kudos
User16869510359
by Esteemed Contributor
  • 745 Views
  • 2 replies
  • 0 kudos

Why should I move to Auto-loader?

I have a streaming workload using the S3-SQS Connector. The streaming job is running fine within the SLA. Should I migrate my job to use the auto-loader? If Yes, what are the benefits? who should migrate and who should not?

  • 745 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

That makes sense @Anand Ladda​ ! One major improvement that will have a direct impact on the performance is the architectural difference. S3-SQS uses an internal implementation of the Delta table to store the checkpoint details about the source files...

  • 0 kudos
1 More Replies
User16869510359
by Esteemed Contributor
  • 1049 Views
  • 2 replies
  • 0 kudos

Resolved! Autoloader: How to identify the backlog in RocksDB

With S3-SQS it was easier to identify the backlog ( the messages that are fetched from SQS and not consumed by the streaming job) How to find the same with Auto-loader

  • 1049 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

For DBR 8.2 or later, the backlog details are captured in the Streaming metricsEg:

  • 0 kudos
1 More Replies
User16783853906
by Contributor III
  • 3741 Views
  • 2 replies
  • 0 kudos

Trigger.once mode recommendation

When is it recommended to use Trigger.once mode compared to fixed processing intervals with micro batches?

  • 3741 Views
  • 2 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

Also note, the configurations like maxFilesPerTrigger, maxBytesPerTrigger are ignored with Trigger.Once. Streaming queries with significantly less throughput can switch to Trigger.Once to avoid the continuous execution of the job checking the availab...

  • 0 kudos
1 More Replies
jose_gonzalez
by Moderator
  • 1322 Views
  • 3 replies
  • 0 kudos

How to check my streaming job's metrics?

I would like to know if there is a way to keep track of my running streaming job.

  • 1322 Views
  • 3 replies
  • 0 kudos
Latest Reply
User16869510359
Esteemed Contributor
  • 0 kudos

Streaming metrics are available/exposed mainly through 3 ways:Streaming UI, which is available from Spark 3/DBR 7Streaming listener/Observable metrics APISpark driver logs. Search for the string "Streaming query made progress". The metrics are logged...

  • 0 kudos
2 More Replies
Srikanth_Gupta_
by Valued Contributor
  • 655 Views
  • 2 replies
  • 1 kudos

What are Best Practices for Spark streaming in Databricks

What are best practices for Spark streaming in Databricksis it good idea to consume multiple topics in one streaming jobis Auto scaling recommended for spark streamingHow many worker nodes we should choose for streaming jobWhen should we run OPTIMIZE...

  • 655 Views
  • 2 replies
  • 1 kudos
Latest Reply
craig_ng
New Contributor III
  • 1 kudos

See our docs for other considerations when deploying a production streaming job.

  • 1 kudos
1 More Replies
sajith_appukutt
by Honored Contributor II
  • 743 Views
  • 1 replies
  • 0 kudos

Resolved! I have a streaming aggregation query with highly variable micro-batch processing times. Seeing a lot of GC pauses in the logs . Any pointers on how to debug ?

Though the data volume is relatively even, the  streaming aggregation query is showing highly variable micro-batch processing times

  • 743 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

By default, the state data (streaming aggregation query) is maintained in the JVM memory of the executors and large number of state objects could put memory pressure on the JVM causing high GC pauses. If you have stateful operations in your streamin...

  • 0 kudos
Labels