cancel
Showing results for 
Search instead for 
Did you mean: 

How does Auto Loader ingest data?

Kenny92
New Contributor III

I have recently completed the Data Engineering with Databricks v3 course on the Partner Academy. Some of the quiz questions have me mixed up.

Specifically, I am wondering about this question from the "Build Data Pipelines with Delta Live Tables and Spark SQL" module.

Which of the following correctly describes how Auto Loader ingests data_ Select one response.I have gathered from submitting with different answers that the answer it is marking as correct is "Auto Loader incrementally ingests new data files in batches."

However, I believe the accurate answer would be "Auto Loader automatically writes new data files continuously as they land." This is based on the doc page What is Auto Loader which says, "Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage." However, I'm struggling to find clear information on exactly how it works under the hood (i.e. is it ingesting landed files in batches or one-at-a-time) and am starting to think both answers are correct depending on how you configure it. Can someone provide clarity on this?

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Kenny Shaevel​ :

You are correct that Auto Loader automatically writes new data files continuously as they land in cloud storage. This means that Auto Loader does not wait for a batch of files to arrive before processing them. Instead, it reads each new file as it lands in cloud storage and automatically converts the data to Delta format, allowing you to immediately query the data using Spark SQL or other tools.

The statement "Auto Loader incrementally ingests new data files in batches" is not entirely accurate. While Auto Loader does process data incrementally, it does not necessarily do so in batches. Instead, it processes each new data file as a separate incremental batch, which allows you to query the new data immediately without waiting for a larger batch to accumulate.

It is important to note that Auto Loader's performance can depend on the size and frequency of the incoming data files, as well as the configuration of the Auto Loader job. For example, you can configure Auto Loader to perform additional processing steps, such as data validation or transformation, before converting the data to Delta format. Additionally, you can adjust the batch size or other settings to optimize performance based on your specific workload and data processing requirements.

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Kenny Shaevel​ :

You are correct that Auto Loader automatically writes new data files continuously as they land in cloud storage. This means that Auto Loader does not wait for a batch of files to arrive before processing them. Instead, it reads each new file as it lands in cloud storage and automatically converts the data to Delta format, allowing you to immediately query the data using Spark SQL or other tools.

The statement "Auto Loader incrementally ingests new data files in batches" is not entirely accurate. While Auto Loader does process data incrementally, it does not necessarily do so in batches. Instead, it processes each new data file as a separate incremental batch, which allows you to query the new data immediately without waiting for a larger batch to accumulate.

It is important to note that Auto Loader's performance can depend on the size and frequency of the incoming data files, as well as the configuration of the Auto Loader job. For example, you can configure Auto Loader to perform additional processing steps, such as data validation or transformation, before converting the data to Delta format. Additionally, you can adjust the batch size or other settings to optimize performance based on your specific workload and data processing requirements.

Anonymous
Not applicable

Hi @Kenny Shaevel​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.