Best practices for working with external locations where many files arrive constantly

pernilak — Tue, 19 Mar 2024 09:34:45 GMT

I have an Azure Function that receives files (not volumes) and dumps them to cloud storage. One-five files are received approx. per second. I want to create a partitioned table in Databricks to work with. How should I do this? E.g.: register the container as an external location and create a bundle that creates a table and continuously trigger on arrival of new files and adds this data into databricks? What would such code look like - or are there something else I should do. I need something that runs continuously. (It is not an option to move the logic from the Azure Function into Databricks). Should an external or managed table be created?

I also have a similar case, with a lot less data - so partitioning is not required. Should then a managed table, external table or a view be created? What are the pros/cones for each in this case.

I would be very happy if someone could provide code - especially if that code works in a continuous job in Databricks (through bundles).

topic Best practices for working with external locations where many files arrive constantly in Get Started Discussions

Best practices for working with external locations where many files arrive constantly