cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Constantine
by Contributor III
  • 1529 Views
  • 1 replies
  • 4 kudos

Resolved! What's the best architecture for Structured Streaming and why?

I am building an ETL pipeline which reads data from a Kafka topic ( data is serialized in Thrift format) and writes it to Delta Table in databricks. I want to have two layersBronze Layer -> which has raw Kafka dataSilver Layer -> which has deserializ...

  • 1529 Views
  • 1 replies
  • 4 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 4 kudos

@John Constantine​ , "Bronze Layer -> which has raw Kafka data"If you use confluent.io, you can also utilize a direct sink to DataLake Storage - bronze layer."Silver Layer -> which has deserialized data"Then use Delta Live Tables to process it to del...

  • 4 kudos
Jreco
by Contributor
  • 4863 Views
  • 6 replies
  • 4 kudos

Resolved! messages from event hub does not flow after a time

Hi Team,I'm trying to build a Real-time solution using Databricks and Event hubs.Something weird happens after a time that the process start.At the begining the messages flow through the process as expected with this rate: please, note that the last ...

image image image
  • 4863 Views
  • 6 replies
  • 4 kudos
Latest Reply
Jreco
Contributor
  • 4 kudos

Thanks for your answer @Hubert Dudek​ , Is already specifiedWhat do youn mean with this? This is the weird part of this, bucause the data is flowing good, but at any time is like the Job stop the reading or somethign like that and if I restart the ...

  • 4 kudos
5 More Replies
User16857281869
by New Contributor II
  • 2144 Views
  • 1 replies
  • 1 kudos

Resolved! Why do I see a cost explosion in my blob storage account (DBFS storage, blob storage, ...) for my structures streaming job?

Its usually one or more of the following reasons:1) If you are streaming into a table, you should be using .Trigger option to specify the frequency of checkpointing. Otherwise, the job will call the storage API every 10ms to log the transaction data...

  • 2144 Views
  • 1 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

please mount cheaper storage (LRS) to custom mount and set there checkpoints,please clear data regularly,if you are using forEac/forEatchBatchh in stream it will save every dataframe on dbfs,please remember not to use display() in production,if on th...

  • 1 kudos
Sandeep
by Contributor III
  • 593 Views
  • 0 replies
  • 4 kudos

spark.apache.org

Per API docs on StreamingQuery.stop(), https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/StreamingQuery.htmlIt says, this stops the execution of this query if it is running and waits until the termination of the query exec...

  • 593 Views
  • 0 replies
  • 4 kudos
BorislavBlagoev
by Valued Contributor III
  • 2548 Views
  • 4 replies
  • 7 kudos

Resolved! Visualization of Structured Streaming in job.

Does Databricks have feature or good pattern to visualize the data from Structured Streaming? Something like display in the notebook.

  • 2548 Views
  • 4 replies
  • 7 kudos
Latest Reply
BorislavBlagoev
Valued Contributor III
  • 7 kudos

I didn't know about that. Thanks!

  • 7 kudos
3 More Replies
RajaLakshmanan
by New Contributor
  • 3258 Views
  • 2 replies
  • 1 kudos

Resolved! Spark StreamingQuery not processing all data from source directory

Hi,I have setup a streaming process that consumers files from HDFS staging directory and writes into target location. Input directory continuesouly gets files from another process.Lets say file producer produces 5 million records sends it to hdfs sta...

  • 3258 Views
  • 2 replies
  • 1 kudos
Latest Reply
User16763506586
Contributor
  • 1 kudos

If it helps , you run try running the Left-Anti join on source and sink to identify missing records and see whether the record is in match with the schema provided or not

  • 1 kudos
1 More Replies
Vu_QuangNguyen
by New Contributor
  • 2785 Views
  • 0 replies
  • 0 kudos

Structured streaming from an overwrite delta path

Hi experts, I need to ingest data from an existing delta path to my own delta lake. The dataflow is as shown in the diagram: Data team reads full snapshot of a database table and overwrite to a delta path. This is done many times per day, but...

0693f000007OoRcAAK
  • 2785 Views
  • 0 replies
  • 0 kudos
Charbel
by New Contributor II
  • 1395 Views
  • 0 replies
  • 1 kudos

Delta table is not writing data read from kafka

Guys, could you help me? I'm reading 5 kafka threads through a list and saving the data in a Delta table The execution will be 1x a day, it seems that everything is working but I noticed that when I read the topic and it has no message, it still gen...

0693f000007OoRrAAK
  • 1395 Views
  • 0 replies
  • 1 kudos
brickster_2018
by Databricks Employee
  • 2419 Views
  • 1 replies
  • 0 kudos

Resolved! Is it mandatory to checkpoint my streaming query.

I have ad-hoc one-time streaming queries where I believe checkpoint won't give any value add. Should I still use checkpointing

  • 2419 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

It's not mandatory. But the strong recommendation is to use Checkpointing for Streaming irrespective of your use case. This is because the default checkpoint location can get a lot of files over time as there is no graceful guaranteed cleaning in pla...

  • 0 kudos
brickster_2018
by Databricks Employee
  • 1397 Views
  • 2 replies
  • 0 kudos

Why should I move to Auto-loader?

I have a streaming workload using the S3-SQS Connector. The streaming job is running fine within the SLA. Should I migrate my job to use the auto-loader? If Yes, what are the benefits? who should migrate and who should not?

  • 1397 Views
  • 2 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

That makes sense @Anand Ladda​ ! One major improvement that will have a direct impact on the performance is the architectural difference. S3-SQS uses an internal implementation of the Delta table to store the checkpoint details about the source files...

  • 0 kudos
1 More Replies
HowardWong
by New Contributor II
  • 608 Views
  • 0 replies
  • 0 kudos

How do you handle Kafka offsets in a DR scenario?

If on one region running a structured streaming job with a checkpoint fails for whatever reason, DR kicks in to run a job in another region. What is the best way for the pick up the offset to continue where the failed job stopped?

  • 608 Views
  • 0 replies
  • 0 kudos
User16826992666
by Valued Contributor
  • 2330 Views
  • 1 replies
  • 0 kudos

Resolved! What is the difference between a trigger once stream and a normal one time write?

It seems to me like both of these would accomplish the same thing in the end. Do they use different mechanisms to accomplish it though? Are there any hidden costs to streaming to consider?

  • 2330 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ryan_Chynoweth
Esteemed Contributor
  • 0 kudos

The biggest reason to use the streaming API over the non-stream API would be to enable the checkpoint log to maintain a processing log. It is most common for people to use the trigger once when they want to only process the changes between executions...

  • 0 kudos
Labels