Data Engineering

Forum Posts

Sorted by:

by sanjay • Valued Contributor II

03-29-2023 11:59:29 PM

32630 Views
21 replies
18 kudos

Resolved! How to limit number of files in each batch in streaming batch processing

Hi,I am running batch job which processes incoming files. I am trying to limit number of files in each batch process so added maxFilesPerTrigger option. But its not working. It processes all incoming files at once.(spark.readStream.format("delta").lo...

Data Engineering

32630 Views
21 replies
18 kudos

03-29-2023 11:59:29 PM

View Replies

Latest Reply

mjedy7
New Contributor II

11-24-2024 10:50:17 PM

18 kudos

Hi @Sandeep ,Can we usespark.readStream.format("delta").option(""maxBytesPerTrigger", "50G").load(silver_path).writeStream.option("checkpointLocation", gold_checkpoint_path).trigger(availableNow=True).foreachBatch(foreachBatchFunction).start()

18 kudos

11-24-2024 10:50:17 PM

20 More Replies

by Valentin1 • New Contributor III

04-02-2023 2:30:24 AM

12039 Views
6 replies
3 kudos

Delta Live Tables Incremental Batch Loads & Failure Recovery

Hello Databricks community,I'm working on a pipeline and would like to implement a common use case using Delta Live Tables. The pipeline should include the following steps:Incrementally load data from Table A as a batch.If the pipeline has previously...

Data Engineering

12039 Views
6 replies
3 kudos

04-02-2023 2:30:24 AM

View Replies

Latest Reply

lprevost
Contributor II

09-21-2024 10:45:44 AM

3 kudos

I totally agree that this is a gap in the Databricks solution. This gap exists between a static read and real time streaming. My problem (and suspect there are many use cases) is that I have slowly changing data coming into structured folders via ...

3 kudos

09-21-2024 10:45:44 AM

5 More Replies

by iptkrisna • New Contributor III

06-14-2023 3:26:53 AM

1752 Views
1 replies
2 kudos

Jobs Data Pipeline Runtime Increase Significantly

Hi, I am facing an issue where one of my jobs taking so long since certain time, previously its only needs less than 1 hour to run a batch job that load json data and do a truncate and load to a delta table, but since june 2nd, it become so long that...

Data Engineering

1752 Views
1 replies
2 kudos

06-14-2023 3:26:53 AM

View Replies

Latest Reply

Anonymous
Not applicable

06-15-2023 11:49:47 PM

2 kudos

Hi @krisna math Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question. Thanks.

2 kudos

06-15-2023 11:49:47 PM

by SRK • Contributor III

12-21-2022 8:29:40 AM

10023 Views
2 replies
0 kudos

How to get the count of dataframe rows when reading through spark.readstream using batch jobs?

I am trying to read messages from kafka topic using spark.readstream, I am using the following code to read it.My CODE:df = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "192.1xx.1.1xx:9xx") .option("subscr...

Data Engineering

10023 Views
2 replies
0 kudos

12-21-2022 8:29:40 AM

View Replies

Latest Reply

daniel_sahal
Databricks MVP

12-22-2022 5:13:54 AM

0 kudos

You can try this approach:https://stackoverflow.com/questions/57568038/how-to-see-the-dataframe-in-the-console-equivalent-of-show-for-structured-st/62161733#62161733ReadStream is running a thread in background so there's no easy way like df.show().

0 kudos

12-22-2022 5:13:54 AM

1 More Replies

by huyd • New Contributor III

11-22-2022 2:47:12 PM

1895 Views
0 replies
4 kudos

Optimizing a batch load process, reading with the JDBC driver

I am doing a batch load, using the JDBC driver from a database table. I am noticing in Sparkui, that there is both memory and disk spill, but only on one executor. I am also, noticing that when trying to use the JDBC parallel read, it seems to run sl...

Data Engineering

1895 Views
0 replies
4 kudos

11-22-2022 2:47:12 PM

Databricks Community

Resolved! How to limit number of files in each batch in streaming batch processing

Delta Live Tables Incremental Batch Loads & Failure Recovery

Jobs Data Pipeline Runtime Increase Significantly

How to get the count of dataframe rows when reading through spark.readstream using batch jobs?

Optimizing a batch load process, reading with the JDBC driver