12-16-2021 06:39 AM
We are having a streaming use case and we see a lot of time in listing from azure.
Is it possible to supply partition to autoloader dynamically on the fly
01-25-2022 05:38 PM
Hi @jose despite using incremental listing I see around 3 to 4 mins consumed in listing , but now we have solved it with eventgrid based approach (intially we tried autoloader and it was not detecting events without flush with close and we fixed the issue on source side by adding close parameter to true on gen 2 sdlk side)
12-16-2021 07:23 AM
I know pain with listings on azure bill 😉 in my case I solved it with lower trigger frequency but
good option can be File notification mode additionally yo can set own queue and event grid to have more control over it (although first experiments can be done with automated ones):
File notification: Uses Azure Event Grid and Queue Storage services that subscribe to file events from the input directory. Auto Loader automatically sets up the Azure Event Grid and Queue Storage services. File notification mode is more performant and scalable for large input directories. To use this mode, you must configure permissions for the Azure Event Grid and Queue Storage services and specify
.option("cloudFiles.useNotifications","true")
. File notifications are supported for ADLS Gen2 and Azure Blob Storage.
source: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/auto-loader-gen2
12-16-2021 07:29 AM
Hi yes it is taking long time and planning to use trigger once with high frequency will also check with event grid but curious why spark can't have a option of take last 2 hrs or 1 hrs for example based on UTC timestamp so that spark will save a lot of time and configuring event
grid with custom trigger do need considerable time and effort
01-25-2022 04:11 PM
Hi @somanath Sankaran ,
I will recommend to use trigger.AvailableNow instead of trigger.once. Here is the link to the docs https://docs.databricks.com/release-notes/runtime/10.1.html#triggeravailablenow-for-auto-loader
Goin back to your original question, you can use incremental listing. partitions can be considered lexically ordered if data is processed once a day, file paths containing timestamps can be considered lexically ordered.
01-25-2022 05:38 PM
Hi @jose despite using incremental listing I see around 3 to 4 mins consumed in listing , but now we have solved it with eventgrid based approach (intially we tried autoloader and it was not detecting events without flush with close and we fixed the issue on source side by adding close parameter to true on gen 2 sdlk side)
09-05-2024 10:18 AM
Hey i am curious how do you monitor the listing costs. In my case it will be listing a folder baes on table name and inside each table name folder i will have folders as yearmonthday for each day basically and in a year there will be 365 folders. My check point looks inside each table name folder. Each day folder will include maybe 100 files per day. Do you think it is better to supply day folder as sink to reduce costs?
01-26-2022 08:02 AM
@somanath Sankaran - Thank you for posting your solution. Would you be happy to mark your answer as best so that other members may find it more quickly?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group