cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Dynamically supplying partitions to autoloader

Soma
Valued Contributor

We are having a streaming use case and we see a lot of time in listing from azure.

Is it possible to supply partition to autoloader dynamically on the fly

1 ACCEPTED SOLUTION

Accepted Solutions

Soma
Valued Contributor

Hi @jose despite using incremental listing I see around 3 to 4 mins consumed in listing , but now we have solved it with eventgrid based approach (intially we tried autoloader and it was not detecting events without flush with close and we fixed the issue on source side by adding close parameter to true on gen 2 sdlk side) โ€‹

View solution in original post

7 REPLIES 7

Kaniz_Fatma
Community Manager
Community Manager

Hi @ Soma! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Hubert-Dudek
Esteemed Contributor III

I know pain with listings on azure bill ๐Ÿ˜‰ in my case I solved it with lower trigger frequency but

good option can be File notification mode additionally yo can set own queue and event grid to have more control over it (although first experiments can be done with automated ones):

File notification: Uses Azure Event Grid and Queue Storage services that subscribe to file events from the input directory. Auto Loader automatically sets up the Azure Event Grid and Queue Storage services. File notification mode is more performant and scalable for large input directories. To use this mode, you must configure permissions for the Azure Event Grid and Queue Storage services and specify

.option("cloudFiles.useNotifications","true")

. File notifications are supported for ADLS Gen2 and Azure Blob Storage.

source: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/auto-loader-gen2

Soma
Valued Contributor

Hi yes it is taking long time and planning to use trigger once with high frequency will also check with event grid but curious why spark can't have a option of take last 2 hrs or 1 hrs for example based on UTC timestamp so that spark will save a lot of time and configuring event

grid with custom trigger do need considerable time and effortโ€‹

Hi @somanath Sankaranโ€‹ ,

I will recommend to use trigger.AvailableNow instead of trigger.once. Here is the link to the docs https://docs.databricks.com/release-notes/runtime/10.1.html#triggeravailablenow-for-auto-loader

Goin back to your original question, you can use incremental listing.  partitions can be considered lexically ordered if data is processed once a day, file paths containing timestamps can be considered lexically ordered.

Docs here https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html#incremental-list...

Soma
Valued Contributor

Hi @jose despite using incremental listing I see around 3 to 4 mins consumed in listing , but now we have solved it with eventgrid based approach (intially we tried autoloader and it was not detecting events without flush with close and we fixed the issue on source side by adding close parameter to true on gen 2 sdlk side) โ€‹

Changedata
New Contributor II

Hey i am curious how do you monitor the listing costs. In my case it will be listing a folder baes on table name and inside each table name folder i will have folders as yearmonthday for each day basically and in a year there will be 365 folders. My check point looks inside each table name folder. Each day folder will include maybe 100 files per day. Do you think it is better to supply day folder as sink to reduce costs?

Anonymous
Not applicable

@somanath Sankaranโ€‹ - Thank you for posting your solution. Would you be happy to mark your answer as best so that other members may find it more quickly?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group