by
Abdus
• New Contributor
- 582 Views
- 1 replies
- 0 kudos
When was the last commit done on Spark Streaming
- 582 Views
- 1 replies
- 0 kudos
Latest Reply
Hi @Abdus! My name is Kaniz, and I'm a technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a response.
- 712 Views
- 1 replies
- 0 kudos
Is there a way to keep my synapse database always in sync with latest data from delta table, My synapse database I believe doesn't support the stream as sink, can i get any workaround
- 712 Views
- 1 replies
- 0 kudos
Latest Reply
You could try to keep the data in sync by appending the new data dataframe in a forEachBatch on your write stream, this method allows for arbitrary ways to write data, you can connect to the Datawarehouse with jdbc if necessary:with your batch functi...
- 622 Views
- 1 replies
- 0 kudos
I have provided the checkpointLocation as below, however I see the config is ignored for my streaming queryoption("checkpointLocation", "path/to/checkpoint/dir")
- 622 Views
- 1 replies
- 0 kudos
Latest Reply
This is a common question from many users. If the streaming checkpoint directory is specified correctly then this behavior is expected. Below is an example of specifying the checkpoint correctlydf.writeStream
.format("parquet")
.option("checkpo...
- 671 Views
- 1 replies
- 0 kudos
I can see my streaming jobs running optimize jobs more frequently, Is there any property so I can control autoOptimize duration
- 671 Views
- 1 replies
- 0 kudos
Latest Reply
The autoOptimize is not performed on a time basis. It's an event-based trigger. Once the delta table/partition has 50 (default value of spark.databricks.delta.autoCompact.minNumFiles) files, auto-compaction is triggered. To reduce the frequency, inc...
- 1080 Views
- 1 replies
- 0 kudos
I have a Spark structured streaming job reading data from Kafka and loading it to the Delta table. I have some transformations and aggregations on the streaming data before writing to Delta table
- 1080 Views
- 1 replies
- 0 kudos
Latest Reply
The typical reason for data loss on a Structured streaming application is having an incorrect value set for watermarking. The watermarking is done to ensure the application does not develop the state for a long period, However, it should be ensured ...
- 1683 Views
- 1 replies
- 0 kudos
I have ad-hoc one-time streaming queries where I believe checkpoint won't give any value add. Should I still use checkpointing
- 1683 Views
- 1 replies
- 0 kudos
Latest Reply
It's not mandatory. But the strong recommendation is to use Checkpointing for Streaming irrespective of your use case. This is because the default checkpoint location can get a lot of files over time as there is no graceful guaranteed cleaning in pla...
- 872 Views
- 2 replies
- 0 kudos
Its preferable to use spark streaming (with Delta) for batch workloads rather then regular batch. With the trigger.once trigger whenever the streaming job is started it will process whatever is available in the source (kafka/kinesis/File System) and ...
- 872 Views
- 2 replies
- 0 kudos
Latest Reply
The streaming checkpoint mechanism is independent of the Trigger type. The way checkpoint works are it creates an offset file when processing the batch and once the batch is completed it creates a commit file for that batch in the checkpoint director...
1 More Replies
- 602 Views
- 1 replies
- 0 kudos
I have an S3-SQS workload. Is it possible to migrate the workload to autoloader without downtime? What are the migration guidelines.
- 602 Views
- 1 replies
- 0 kudos
Latest Reply
The SQS queue used by the existing application can be utilized by the auto-loader thereby ensuring minimal downtime
- 745 Views
- 2 replies
- 0 kudos
I have a streaming workload using the S3-SQS Connector. The streaming job is running fine within the SLA. Should I migrate my job to use the auto-loader? If Yes, what are the benefits? who should migrate and who should not?
- 745 Views
- 2 replies
- 0 kudos
Latest Reply
That makes sense @Anand Ladda​ ! One major improvement that will have a direct impact on the performance is the architectural difference. S3-SQS uses an internal implementation of the Delta table to store the checkpoint details about the source files...
1 More Replies
- 1049 Views
- 2 replies
- 0 kudos
With S3-SQS it was easier to identify the backlog ( the messages that are fetched from SQS and not consumed by the streaming job) How to find the same with Auto-loader
- 1049 Views
- 2 replies
- 0 kudos
Latest Reply
For DBR 8.2 or later, the backlog details are captured in the Streaming metricsEg:
1 More Replies
- 3741 Views
- 2 replies
- 0 kudos
When is it recommended to use Trigger.once mode compared to fixed processing intervals with micro batches?
- 3741 Views
- 2 replies
- 0 kudos
Latest Reply
Also note, the configurations like maxFilesPerTrigger, maxBytesPerTrigger are ignored with Trigger.Once. Streaming queries with significantly less throughput can switch to Trigger.Once to avoid the continuous execution of the job checking the availab...
1 More Replies
- 1322 Views
- 3 replies
- 0 kudos
I would like to know if there is a way to keep track of my running streaming job.
- 1322 Views
- 3 replies
- 0 kudos
Latest Reply
Streaming metrics are available/exposed mainly through 3 ways:Streaming UI, which is available from Spark 3/DBR 7Streaming listener/Observable metrics APISpark driver logs. Search for the string "Streaming query made progress". The metrics are logged...
2 More Replies
- 655 Views
- 2 replies
- 1 kudos
What are best practices for Spark streaming in Databricksis it good idea to consume multiple topics in one streaming jobis Auto scaling recommended for spark streamingHow many worker nodes we should choose for streaming jobWhen should we run OPTIMIZE...
- 655 Views
- 2 replies
- 1 kudos
Latest Reply
See our docs for other considerations when deploying a production streaming job.
1 More Replies
- 743 Views
- 1 replies
- 0 kudos
Though the data volume is relatively even, the streaming aggregation query is showing highly variable micro-batch processing times
- 743 Views
- 1 replies
- 0 kudos
Latest Reply
By default, the state data (streaming aggregation query) is maintained in the JVM memory of the executors and large number of state objects could put memory pressure on the JVM causing high GC pauses. If you have stateful operations in your streamin...