topic Dynamically supplying partitions to autoloader in Data Engineering

Dynamically supplying partitions to autoloader

Soma — Thu, 16 Dec 2021 14:39:39 GMT

We are having a streaming use case and we see a lot of time in listing from azure.

Is it possible to supply partition to autoloader dynamically on the fly

Re: Dynamically supplying partitions to autoloader

Hubert-Dudek — Thu, 16 Dec 2021 15:23:29 GMT

I know pain with listings on azure bill 😉 in my case I solved it with lower trigger frequency but

good option can be File notification mode additionally yo can set own queue and event grid to have more control over it (although first experiments can be done with automated ones):

File notification: Uses Azure Event Grid and Queue Storage services that subscribe to file events from the input directory. Auto Loader automatically sets up the Azure Event Grid and Queue Storage services. File notification mode is more performant and scalable for large input directories. To use this mode, you must configure permissions for the Azure Event Grid and Queue Storage services and specify

.option("cloudFiles.useNotifications","true")

. File notifications are supported for ADLS Gen2 and Azure Blob Storage.

source: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/auto-loader-gen2

Re: Dynamically supplying partitions to autoloader

Soma — Thu, 16 Dec 2021 15:29:18 GMT

Hi yes it is taking long time and planning to use trigger once with high frequency will also check with event grid but curious why spark can't have a option of take last 2 hrs or 1 hrs for example based on UTC timestamp so that spark will save a lot of time and configuring event

grid with custom trigger do need considerable time and effort

Re: Dynamically supplying partitions to autoloader

jose_gonzalez — Wed, 26 Jan 2022 00:11:42 GMT

Hi @somanath Sankaran ,

I will recommend to use trigger.AvailableNow instead of trigger.once. Here is the link to the docs https://docs.databricks.com/release-notes/runtime/10.1.html#triggeravailablenow-for-auto-loader

Goin back to your original question, you can use incremental listing. partitions can be considered lexically ordered if data is processed once a day, file paths containing timestamps can be considered lexically ordered.

Docs here https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html#incremental-listing

Re: Dynamically supplying partitions to autoloader

Soma — Wed, 26 Jan 2022 01:38:21 GMT

Hi @jose despite using incremental listing I see around 3 to 4 mins consumed in listing , but now we have solved it with eventgrid based approach (intially we tried autoloader and it was not detecting events without flush with close and we fixed the issue on source side by adding close parameter to true on gen 2 sdlk side)

Re: Dynamically supplying partitions to autoloader

Anonymous — Wed, 26 Jan 2022 16:02:13 GMT

@somanath Sankaran - Thank you for posting your solution. Would you be happy to mark your answer as best so that other members may find it more quickly?

Re: Dynamically supplying partitions to autoloader

Changedata — Thu, 05 Sep 2024 17:18:13 GMT

Hey i am curious how do you monitor the listing costs. In my case it will be listing a folder baes on table name and inside each table name folder i will have folders as yearmonthday for each day basically and in a year there will be 365 folders. My check point looks inside each table name folder. Each day folder will include maybe 100 files per day. Do you think it is better to supply day folder as sink to reduce costs?