Databricks Community

Soma · ‎12-16-2021

We are having a streaming use case and we see a lot of time in listing from azure.

Is it possible to supply partition to autoloader dynamically on the fly

Soma · ‎01-25-2022

Hi @jose despite using incremental listing I see around 3 to 4 mins consumed in listing , but now we have solved it with eventgrid based approach (intially we tried autoloader and it was not detecting events without flush with close and we fixed the issue on source side by adding close parameter to true on gen 2 sdlk side)

View solution in original post

Hubert-Dudek · ‎12-16-2021

I know pain with listings on azure bill 😉 in my case I solved it with lower trigger frequency but

good option can be File notification mode additionally yo can set own queue and event grid to have more control over it (although first experiments can be done with automated ones):

File notification: Uses Azure Event Grid and Queue Storage services that subscribe to file events from the input directory. Auto Loader automatically sets up the Azure Event Grid and Queue Storage services. File notification mode is more performant and scalable for large input directories. To use this mode, you must configure permissions for the Azure Event Grid and Queue Storage services and specify

.option("cloudFiles.useNotifications","true")

. File notifications are supported for ADLS Gen2 and Azure Blob Storage.

source: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/auto-loader-gen2

Soma · ‎12-16-2021

Hi yes it is taking long time and planning to use trigger once with high frequency will also check with event grid but curious why spark can't have a option of take last 2 hrs or 1 hrs for example based on UTC timestamp so that spark will save a lot of time and configuring event

grid with custom trigger do need considerable time and effort

jose_gonzalez · ‎01-25-2022

Hi @somanath Sankaran ,

I will recommend to use trigger.AvailableNow instead of trigger.once. Here is the link to the docs https://docs.databricks.com/release-notes/runtime/10.1.html#triggeravailablenow-for-auto-loader

Goin back to your original question, you can use incremental listing. partitions can be considered lexically ordered if data is processed once a day, file paths containing timestamps can be considered lexically ordered.

Docs here https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-gen2.html#incremental-list...

Soma · ‎01-25-2022

Hi @jose despite using incremental listing I see around 3 to 4 mins consumed in listing , but now we have solved it with eventgrid based approach (intially we tried autoloader and it was not detecting events without flush with close and we fixed the issue on source side by adding close parameter to true on gen 2 sdlk side)

Changedata · ‎09-05-2024

Hey i am curious how do you monitor the listing costs. In my case it will be listing a folder baes on table name and inside each table name folder i will have folders as yearmonthday for each day basically and in a year there will be 365 folders. My check point looks inside each table name folder. Each day folder will include maybe 100 files per day. Do you think it is better to supply day folder as sink to reduce costs?

Anonymous · ‎01-26-2022

@somanath Sankaran - Thank you for posting your solution. Would you be happy to mark your answer as best so that other members may find it more quickly?

Databricks Community

Dynamically supplying partitions to autoloader

Join Us as a Local Community Builder!

Free Edition Hackathon

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Zerobus Ingest in Action: How to Stream Event Data Directly into Your Lakehouse

Find Sensitive Data at Scale with Data Classification in Unity Catalog

🚀 New: Databricks Interactive Architecture Design Workshops