cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How does Auto Loader ingest data?

Kenny92
New Contributor III

I have recently completed the Data Engineering with Databricks v3 course on the Partner Academy. Some of the quiz questions have me mixed up.

Specifically, I am wondering about this question from the "Build Data Pipelines with Delta Live Tables and Spark SQL" module.

Which of the following correctly describes how Auto Loader ingests data_ Select one response.I have gathered from submitting with different answers that the answer it is marking as correct is "Auto Loader incrementally ingests new data files in batches."

However, I believe the accurate answer would be "Auto Loader automatically writes new data files continuously as they land." This is based on the doc page What is Auto Loader which says, "Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage." However, I'm struggling to find clear information on exactly how it works under the hood (i.e. is it ingesting landed files in batches or one-at-a-time) and am starting to think both answers are correct depending on how you configure it. Can someone provide clarity on this?

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Kenny Shaevel​ :

You are correct that Auto Loader automatically writes new data files continuously as they land in cloud storage. This means that Auto Loader does not wait for a batch of files to arrive before processing them. Instead, it reads each new file as it lands in cloud storage and automatically converts the data to Delta format, allowing you to immediately query the data using Spark SQL or other tools.

The statement "Auto Loader incrementally ingests new data files in batches" is not entirely accurate. While Auto Loader does process data incrementally, it does not necessarily do so in batches. Instead, it processes each new data file as a separate incremental batch, which allows you to query the new data immediately without waiting for a larger batch to accumulate.

It is important to note that Auto Loader's performance can depend on the size and frequency of the incoming data files, as well as the configuration of the Auto Loader job. For example, you can configure Auto Loader to perform additional processing steps, such as data validation or transformation, before converting the data to Delta format. Additionally, you can adjust the batch size or other settings to optimize performance based on your specific workload and data processing requirements.

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Kenny Shaevel​ :

You are correct that Auto Loader automatically writes new data files continuously as they land in cloud storage. This means that Auto Loader does not wait for a batch of files to arrive before processing them. Instead, it reads each new file as it lands in cloud storage and automatically converts the data to Delta format, allowing you to immediately query the data using Spark SQL or other tools.

The statement "Auto Loader incrementally ingests new data files in batches" is not entirely accurate. While Auto Loader does process data incrementally, it does not necessarily do so in batches. Instead, it processes each new data file as a separate incremental batch, which allows you to query the new data immediately without waiting for a larger batch to accumulate.

It is important to note that Auto Loader's performance can depend on the size and frequency of the incoming data files, as well as the configuration of the Auto Loader job. For example, you can configure Auto Loader to perform additional processing steps, such as data validation or transformation, before converting the data to Delta format. Additionally, you can adjust the batch size or other settings to optimize performance based on your specific workload and data processing requirements.

Anonymous
Not applicable

Hi @Kenny Shaevel​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group