Data Engineering

Forum Posts

Sorted by:

by Maksym • New Contributor III

01-19-2022 1:36:14 AM

9690 Views
5 replies
7 kudos

Resolved! Databricks Autoloader is getting stuck and does not pass to the next batch

I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:df = (spark .readStream .format("cloudFiles") .option('cloudFil...

Data Engineering

9690 Views
5 replies
7 kudos

01-19-2022 1:36:14 AM

View Replies

Latest Reply

lassebe
New Contributor II

08-31-2023 1:22:00 AM

7 kudos

I had the same issue: files would randomly not be loaded.Setting `.option("cloudFiles.useIncrementalListing", False)` Seemed to do the trick!

7 kudos

08-31-2023 1:22:00 AM

4 More Replies

by sanjay • Valued Contributor II

03-29-2023 11:59:29 PM

27103 Views
21 replies
18 kudos

Resolved! How to limit number of files in each batch in streaming batch processing

Hi,I am running batch job which processes incoming files. I am trying to limit number of files in each batch process so added maxFilesPerTrigger option. But its not working. It processes all incoming files at once.(spark.readStream.format("delta").lo...

Data Engineering

27103 Views
21 replies
18 kudos

03-29-2023 11:59:29 PM

View Replies

Latest Reply

mjedy7
New Contributor II

11-24-2024 10:50:17 PM

18 kudos

Hi @Sandeep ,Can we usespark.readStream.format("delta").option(""maxBytesPerTrigger", "50G").load(silver_path).writeStream.option("checkpointLocation", gold_checkpoint_path).trigger(availableNow=True).foreachBatch(foreachBatchFunction).start()

18 kudos

11-24-2024 10:50:17 PM

20 More Replies

by Vladif1 • New Contributor II

03-29-2023 9:43:41 PM

8189 Views
8 replies
1 kudos

Error when reading delta lake files with Auto Loader

Hi,When reading Delta Lake file (created by Auto Loader) with this code: df = ( spark.readStream .format('cloudFiles') .option("cloudFiles.format", "delta") .option("cloudFiles.schemaLocation", f"{silver_path}/_checkpoint") .load(bronz...

Data Engineering

8189 Views
8 replies
1 kudos

03-29-2023 9:43:41 PM

View Replies

Latest Reply

Panda
Valued Contributor

10-15-2024 5:09:31 PM

1 kudos

@Vladif1 The error occurs because the cloudFiles format in Auto Loader is meant for reading raw file formats like CSV, JSON ... for ingestion for more Format Support. For Delta tables, you should use the Delta format directly. #Sample Example bronze...

1 kudos

10-15-2024 5:09:31 PM

7 More Replies

by MadelynM • Databricks Employee

08-16-2022 1:29:35 AM

9936 Views
2 replies
0 kudos

Delta Live Tables + S3 | 5 tips for cloud storage with DLT

You’ve gotten familiar with Delta Live Tables (DLT) via the quickstart and getting started guide. Now it’s time to tackle creating a DLT data pipeline for your cloud storage–with one line of code. Here’s how it’ll look when you're starting:CREATE OR ...

Data Engineering

9936 Views
2 replies
0 kudos

08-16-2022 1:29:35 AM

View Replies

Latest Reply

waynelxb
New Contributor II

10-13-2024 5:43:03 AM

0 kudos

Hi MadelynM,How should we handle Source File Archival and Data Retention with DLT? Source File Archival: Once the data from source file is loaded with DLT Auto Loader, we want to move the source file from source folder to archival folder. How can we ...

0 kudos

10-13-2024 5:43:03 AM

1 More Replies

by Dave_Nithio • Contributor II

10-13-2022 7:22:53 AM

6720 Views
4 replies
2 kudos

Resolved! How to use autoloader with csv containing spaces in attribute names?

I am attempting to use autoloader to add a number of csv files to a delta table. The underlying csv files have spaces in the attribute names though (i.e. 'Account Number' instead of 'AccountNumber'). When I run my autoload, I get the following error ...

Data Engineering

6720 Views
4 replies
2 kudos

10-13-2022 7:22:53 AM

View Replies

Latest Reply

Dave_Nithio
Contributor II

11-01-2022 12:45:32 PM

2 kudos

@Hubert Dudek thanks for your response! I was able to use what you proposed above to generate the schema. The issue is that the schema sets all attributes to STRING values and renames them numerically ('_c0', '_c1', etc.). Although this allows us to...

2 kudos

11-01-2022 12:45:32 PM

3 More Replies

by Soma • Valued Contributor

12-16-2021 6:39:39 AM

4222 Views
6 replies
3 kudos

Resolved! Dynamically supplying partitions to autoloader

We are having a streaming use case and we see a lot of time in listing from azure.Is it possible to supply partition to autoloader dynamically on the fly

Data Engineering

4222 Views
6 replies
3 kudos

12-16-2021 6:39:39 AM

View Replies

Latest Reply

Anonymous
Not applicable

01-26-2022 8:02:13 AM

3 kudos

@somanath Sankaran - Thank you for posting your solution. Would you be happy to mark your answer as best so that other members may find it more quickly?

3 kudos

01-26-2022 8:02:13 AM

5 More Replies

by FabriceDeseyn • Contributor

03-15-2023 2:52:16 AM

9439 Views
6 replies
6 kudos

Resolved! What does autoloader's cloudfiles.backfillInterval do?

I'm using autoloader directory listing mode (without incremental file listing) and sometimes, new files are not picked up and found in the cloud_files-listing.I have found that using the 'cloudfiles.backfillInterval'-option can resolve the detection ...

Data Engineering

9439 Views
6 replies
6 kudos

03-15-2023 2:52:16 AM

View Replies

Latest Reply

822025
New Contributor II

08-30-2024 8:19:18 AM

6 kudos

If we set the backfill to 1 week, will it run only 1ce a week or rather it will look for old files not processed in every trigger ?For eg :- if we set it to 1 day and the job runs every hour, then will it look for files in past 24 hours on a sliding ...

6 kudos

08-30-2024 8:19:18 AM

5 More Replies

by MRTN • New Contributor III

04-28-2023 3:52:41 AM

5419 Views
3 replies
2 kudos

Resolved! Configure multiple source paths for auto loader

I am currently using two streams to monitor data in two different containers on an Azure storage account. Is there any way to configure an autoloader to read from two different locations? The schemas of the files are identical.

Data Engineering

5419 Views
3 replies
2 kudos

04-28-2023 3:52:41 AM

View Replies

Latest Reply

Anonymous
Not applicable

05-13-2023 10:00:01 AM

2 kudos

@Morten Stakkeland :Yes, it's possible to configure an autoloader to read from multiple locations.You can define multiple CloudFiles sources for the autoloader, each pointing to a different container in the same storage account. In your case, since ...

2 kudos

05-13-2023 10:00:01 AM

2 More Replies

by herry • New Contributor III

12-06-2021 3:12:04 AM

4154 Views
4 replies
4 kudos

Resolved! Get the list of loaded files from Autoloader

Hello,We can use Autoloader to track the files that have been loaded from S3 bucket or not. My question about Autoloader: is there a way to read the Autoloader database to get the list of files that have been loaded?I can easily do this in AWS Glue j...

Data Engineering

4154 Views
4 replies
4 kudos

12-06-2021 3:12:04 AM

View Replies

Latest Reply

Anonymous
Not applicable

12-09-2021 8:12:42 AM

4 kudos

@Herry Ramli - Would you be happy to mark Hubert's answer as best so that other members can find the solution more easily?Thanks!

4 kudos

12-09-2021 8:12:42 AM

3 More Replies

by fhmessas • New Contributor II

06-01-2023 9:44:48 AM

2354 Views
2 replies
2 kudos

Trigger.AvailableNow getting stuck when there is no event

Hi, I have several streaming jobs, however one of them uses the Trigger.AvailableNow. The issue is that it gets stuck when there is no events or finishes ingesting all events. The expected behavior would be the job being shutdown.I've already checked...

Data Engineering

2354 Views
2 replies
2 kudos

06-01-2023 9:44:48 AM

View Replies

Latest Reply

fhmessas
New Contributor II

06-14-2023 4:03:03 PM

2 kudos

Hi, the source is an S3 bucket using file notification with SQS.No errors or warns in the logs, the AvailableNow trigger just gets stuck.

2 kudos

06-14-2023 4:03:03 PM

1 More Replies

by sanjay • Valued Contributor II

06-05-2023 11:18:17 PM

4895 Views
3 replies
2 kudos

Resolved! Autoloader maxFilesPerTrigger not working correctly

Hi,am trying to apply batch size in autoloader and code is as below. But its picking all the changes in one go even if I have put maxFilesPerTrigger as 10. Appreciate any help.(spark.readStream.format("json").schema(streamSchema).option("cloudFiles.b...

Data Engineering

4895 Views
3 replies
2 kudos

06-05-2023 11:18:17 PM

View Replies

Latest Reply

Lakshay
Databricks Employee

06-07-2023 12:11:04 PM

2 kudos

Hi @Sanjay Jain , Since you have provided the trigger as once, the maxFilesPerTrigger will not take effect here. With trigger once, all the files will be read together. You need to change the trigger for this option to come into effect.Please refer ...

2 kudos

06-07-2023 12:11:04 PM

2 More Replies

by sanjay • Valued Contributor II

06-02-2023 12:00:04 AM

1917 Views
2 replies
1 kudos

Resolved! How can I prioritize message in autoloader

Hi,I am using autoloader, it picks data from AWS S3 and stores in delta table. In case there are large number of messages, I like to process messages by priority. Is it possible to prioritize messages in autoloader.Regards,Sanjay

Data Engineering

1917 Views
2 replies
1 kudos

06-02-2023 12:00:04 AM

View Replies

Latest Reply

sanjay
Valued Contributor II

06-02-2023 4:47:44 AM

1 kudos

Thank you Sandeep. Other option is I can keep messages in 2 different folders in S3. Can autoloader read message from multiple folders

1 kudos

06-02-2023 4:47:44 AM

1 More Replies

by Enzo_Bahrami • New Contributor III

05-25-2023 5:53:30 PM

3561 Views
2 replies
0 kudos

Resolved! Input File Path from Autoloader in Delta Live Tables

Hello everyone!I was wondering if there is any way to get the subdirectories in which the file resides while loading while loading using Autoloader with DLT. For example:def customer(): return ( spark.readStream.format('cloudfiles') .option('clou...

Data Engineering

3561 Views
2 replies
0 kudos

05-25-2023 5:53:30 PM

View Replies

Latest Reply

Anonymous
Not applicable

06-01-2023 1:37:22 AM

0 kudos

Hi @Parsa Bahraminejad We haven't heard from you since the last response from @Vigneshraja Palaniraj , and I was checking back to see if her suggestions helped you.Or else, If you have any solution, please share it with the community, as it can be...

0 kudos

06-01-2023 1:37:22 AM

1 More Replies

by Veeru245 • New Contributor

05-30-2023 2:26:33 AM

1254 Views
0 replies
0 kudos

Autoloader Solution for Binary files

We have solution implemented for ingesting binary file ( .ZIP ) into delta lake, Currently we are using the below solution within our pipeline.Unzip the file and extract the XML file.Parse the XML using python libraries.Flatten the nested xml columns...

Data Engineering

1254 Views
0 replies
0 kudos

05-30-2023 2:26:33 AM

by pvignesh92 • Honored Contributor

05-22-2023 12:10:59 AM

1763 Views
1 replies
3 kudos

lnkd.in

Databricks Auto Loader is an interesting feature that can be used to load data incrementally.✳ It can process new data files as they arrive in the cloud object stores✳ It can be used to ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT and even Binary file ...

Data Engineering

1763 Views
1 replies
3 kudos

05-22-2023 12:10:59 AM

View Replies

Latest Reply

Ajay-Pandey
Esteemed Contributor III

05-22-2023 1:51:07 AM

3 kudos

Thanks for sharing

3 kudos

05-22-2023 1:51:07 AM

Databricks Community

Resolved! Databricks Autoloader is getting stuck and does not pass to the next batch

Resolved! How to limit number of files in each batch in streaming batch processing

Error when reading delta lake files with Auto Loader

Delta Live Tables + S3 | 5 tips for cloud storage with DLT

Resolved! How to use autoloader with csv containing spaces in attribute names?

Resolved! Dynamically supplying partitions to autoloader

Resolved! What does autoloader's cloudfiles.backfillInterval do?

Resolved! Configure multiple source paths for auto loader

Resolved! Get the list of loaded files from Autoloader

Trigger.AvailableNow getting stuck when there is no event

Resolved! Autoloader maxFilesPerTrigger not working correctly

Resolved! How can I prioritize message in autoloader

Resolved! Input File Path from Autoloader in Delta Live Tables

Autoloader Solution for Binary files

lnkd.in