Topics with Label: Spark streaming

Forum Posts

Sorted by:

Start a conversation

by Abdus • New Contributor

07-15-2021 10:55:02 AM

582 Views
1 replies
0 kudos

Apache spark Streaming

When was the last commit done on Spark Streaming

Data Engineering

582 Views
1 replies
0 kudos

07-15-2021 10:55:02 AM

View Replies

Latest Reply

Kaniz
Community Manager

08-20-2021 4:03:39 AM

0 kudos

Hi @Abdus! My name is Kaniz, and I'm a technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a response.

0 kudos

08-20-2021 4:03:39 AM

by User16869510359 • Esteemed Contributor

06-25-2021 3:46:28 PM

730 Views
1 replies
0 kudos

What are the advantages of using RocksDB State store compared to HDFS backed state store

Data Engineering

730 Views
1 replies
0 kudos

06-25-2021 3:46:28 PM

View Replies

Latest Reply

aladda
Honored Contributor II

06-25-2021 4:01:35 PM

0 kudos

Can you provide some additional details on this? What components are we comparing the states for?

0 kudos

06-25-2021 4:01:35 PM

by User16826994223 • Honored Contributor III

06-25-2021 9:15:10 AM

712 Views
1 replies
0 kudos

Delta Table to Spark Streaming to Synapse Table in azure databricks

Is there a way to keep my synapse database always in sync with latest data from delta table, My synapse database I believe doesn't support the stream as sink, can i get any workaround

Data Engineering

712 Views
1 replies
0 kudos

06-25-2021 9:15:10 AM

View Replies

Latest Reply

User16826994223
Honored Contributor III

06-25-2021 9:17:48 AM

0 kudos

You could try to keep the data in sync by appending the new data dataframe in a forEachBatch on your write stream, this method allows for arbitrary ways to write data, you can connect to the Datawarehouse with jdbc if necessary:with your batch functi...

0 kudos

06-25-2021 9:17:48 AM

by User16869510359 • Esteemed Contributor

06-25-2021 6:33:55 AM

622 Views
1 replies
0 kudos

Resolved! Why is my streaming job not resuming even though I specified checkpoint directory

I have provided the checkpointLocation as below, however I see the config is ignored for my streaming queryoption("checkpointLocation", "path/to/checkpoint/dir")

Data Engineering

622 Views
1 replies
0 kudos

06-25-2021 6:33:55 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 6:34:46 AM

0 kudos

This is a common question from many users. If the streaming checkpoint directory is specified correctly then this behavior is expected. Below is an example of specifying the checkpoint correctlydf.writeStream .format("parquet") .option("checkpo...

0 kudos

06-25-2021 6:34:46 AM

by User16869510359 • Esteemed Contributor

06-25-2021 6:19:20 AM

671 Views
1 replies
0 kudos

Resolved! Is there any way to control the autoOptimize interval?

I can see my streaming jobs running optimize jobs more frequently, Is there any property so I can control autoOptimize duration

Data Engineering

671 Views
1 replies
0 kudos

06-25-2021 6:19:20 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 6:23:21 AM

0 kudos

The autoOptimize is not performed on a time basis. It's an event-based trigger. Once the delta table/partition has 50 (default value of spark.databricks.delta.autoCompact.minNumFiles) files, auto-compaction is triggered. To reduce the frequency, inc...

0 kudos

06-25-2021 6:23:21 AM

by User16869510359 • Esteemed Contributor

06-25-2021 5:51:31 AM

1080 Views
1 replies
0 kudos

Resolved! Why do I see data loss with Structured streaming jobs?

I have a Spark structured streaming job reading data from Kafka and loading it to the Delta table. I have some transformations and aggregations on the streaming data before writing to Delta table

Data Engineering

1080 Views
1 replies
0 kudos

06-25-2021 5:51:31 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-25-2021 5:53:04 AM

0 kudos

The typical reason for data loss on a Structured streaming application is having an incorrect value set for watermarking. The watermarking is done to ensure the application does not develop the state for a long period, However, it should be ensured ...

0 kudos

06-25-2021 5:53:04 AM

by User16869510359 • Esteemed Contributor

06-24-2021 6:54:39 AM

1683 Views
1 replies
0 kudos

Resolved! Is it mandatory to checkpoint my streaming query.

I have ad-hoc one-time streaming queries where I believe checkpoint won't give any value add. Should I still use checkpointing

Data Engineering

1683 Views
1 replies
0 kudos

06-24-2021 6:54:39 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-24-2021 6:57:33 AM

0 kudos

It's not mandatory. But the strong recommendation is to use Checkpointing for Streaming irrespective of your use case. This is because the default checkpoint location can get a lot of files over time as there is no graceful guaranteed cleaning in pla...

0 kudos

06-24-2021 6:57:33 AM

by User16783855534 • New Contributor III

06-23-2021 12:50:12 PM

872 Views
2 replies
0 kudos

Should/Can I use spark streaming for Batch workloads?

Its preferable to use spark streaming (with Delta) for batch workloads rather then regular batch. With the trigger.once trigger whenever the streaming job is started it will process whatever is available in the source (kafka/kinesis/File System) and ...

Data Engineering

872 Views
2 replies
0 kudos

06-23-2021 12:50:12 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-24-2021 6:52:51 AM

0 kudos

The streaming checkpoint mechanism is independent of the Trigger type. The way checkpoint works are it creates an offset file when processing the batch and once the batch is completed it creates a commit file for that batch in the checkpoint director...

0 kudos

06-24-2021 6:52:51 AM

1 More Replies

by User16869510359 • Esteemed Contributor

06-23-2021 3:56:43 PM

602 Views
1 replies
0 kudos

How to migrate to Auto-loader without downtime?

I have an S3-SQS workload. Is it possible to migrate the workload to autoloader without downtime? What are the migration guidelines.

Data Engineering

602 Views
1 replies
0 kudos

06-23-2021 3:56:43 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-24-2021 6:32:32 AM

0 kudos

The SQS queue used by the existing application can be utilized by the auto-loader thereby ensuring minimal downtime

0 kudos

06-24-2021 6:32:32 AM

by User16869510359 • Esteemed Contributor

06-23-2021 3:54:45 PM

745 Views
2 replies
0 kudos

Why should I move to Auto-loader?

I have a streaming workload using the S3-SQS Connector. The streaming job is running fine within the SLA. Should I migrate my job to use the auto-loader? If Yes, what are the benefits? who should migrate and who should not?

Data Engineering

745 Views
2 replies
0 kudos

06-23-2021 3:54:45 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 10:26:38 PM

0 kudos

That makes sense @Anand Ladda ! One major improvement that will have a direct impact on the performance is the architectural difference. S3-SQS uses an internal implementation of the Delta table to store the checkpoint details about the source files...

0 kudos

06-23-2021 10:26:38 PM

1 More Replies

by User16869510359 • Esteemed Contributor

06-23-2021 4:25:32 PM

1049 Views
2 replies
0 kudos

Resolved! Autoloader: How to identify the backlog in RocksDB

With S3-SQS it was easier to identify the backlog ( the messages that are fetched from SQS and not consumed by the streaming job) How to find the same with Auto-loader

Data Engineering

1049 Views
2 replies
0 kudos

06-23-2021 4:25:32 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 4:29:42 PM

0 kudos

For DBR 8.2 or later, the backlog details are captured in the Streaming metricsEg:

0 kudos

06-23-2021 4:29:42 PM

1 More Replies

by User16783853906 • Contributor III

06-23-2021 2:14:56 PM

3741 Views
2 replies
0 kudos

Trigger.once mode recommendation

When is it recommended to use Trigger.once mode compared to fixed processing intervals with micro batches?

Data Engineering

3741 Views
2 replies
0 kudos

06-23-2021 2:14:56 PM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 2:26:12 PM

0 kudos

Also note, the configurations like maxFilesPerTrigger, maxBytesPerTrigger are ignored with Trigger.Once. Streaming queries with significantly less throughput can switch to Trigger.Once to avoid the continuous execution of the job checking the availab...

0 kudos

06-23-2021 2:26:12 PM

1 More Replies

by jose_gonzalez • Moderator

06-04-2021 11:41:56 AM

1322 Views
3 replies
0 kudos

How to check my streaming job's metrics?

I would like to know if there is a way to keep track of my running streaming job.

Data Engineering

1322 Views
3 replies
0 kudos

06-04-2021 11:41:56 AM

View Replies

Latest Reply

User16869510359
Esteemed Contributor

06-23-2021 6:20:01 AM

0 kudos

Streaming metrics are available/exposed mainly through 3 ways:Streaming UI, which is available from Spark 3/DBR 7Streaming listener/Observable metrics APISpark driver logs. Search for the string "Streaming query made progress". The metrics are logged...

0 kudos

06-23-2021 6:20:01 AM

2 More Replies

by Srikanth_Gupta_ • Valued Contributor

06-14-2021 3:15:21 PM

655 Views
2 replies
1 kudos

What are Best Practices for Spark streaming in Databricks

What are best practices for Spark streaming in Databricksis it good idea to consume multiple topics in one streaming jobis Auto scaling recommended for spark streamingHow many worker nodes we should choose for streaming jobWhen should we run OPTIMIZE...

Data Engineering

655 Views
2 replies
1 kudos

06-14-2021 3:15:21 PM

View Replies

Latest Reply

craig_ng
New Contributor III

06-18-2021 10:37:30 AM

1 kudos

See our docs for other considerations when deploying a production streaming job.

1 kudos

06-18-2021 10:37:30 AM

1 More Replies

by sajith_appukutt • Honored Contributor II

06-09-2021 1:20:06 AM

743 Views
1 replies
0 kudos

Resolved! I have a streaming aggregation query with highly variable micro-batch processing times. Seeing a lot of GC pauses in the logs . Any pointers on how to debug ?

Though the data volume is relatively even, the streaming aggregation query is showing highly variable micro-batch processing times

Data Engineering

743 Views
1 replies
0 kudos

06-09-2021 1:20:06 AM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-17-2021 4:14:58 PM

0 kudos

By default, the state data (streaming aggregation query) is maintained in the JVM memory of the executors and large number of state objects could put memory pressure on the JVM causing high GC pauses. If you have stateful operations in your streamin...

0 kudos

06-17-2021 4:14:58 PM