Best practices for working with external locations where many files arrive constantly

Get Started Discussions

Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.

I have an Azure Function that receives files (not volumes) and dumps them to cloud storage. One-five files are received approx. per second. I want to create a partitioned table in Databricks to work with. How should I do this? E.g.: register the container as an external location and create a bundle that creates a table and continuously trigger on arrival of new files and adds this data into databricks? What would such code look like - or are there something else I should do. I need something that runs continuously. (It is not an option to move the logic from the Azure Function into Databricks). Should an external or managed table be created?

I also have a similar case, with a lot less data - so partitioning is not required. Should then a managed table, external table or a view be created? What are the pros/cones for each in this case.

I would be very happy if someone could provide code - especially if that code works in a continuous job in Databricks (through bundles).

0 REPLIES 0

Photos

Upload Upload
URL URL
Saved Photos Saved Photos

Upload location

Upload location

Add Photos to Album:

New Album

Drag here to start uploading

Drag photos here or

Tap for upload options

You must install or upgrade to the latest version of Adobe Flash Player before you can upload images.