Databricks Community

MCosta · ‎07-23-2021

Hi Folks,

I'm evaluating Delta Lake to store image / data version control to be used to train models. I looked at a session explaining how to do this and also using MLflow to manage training (https://databricks.com/session_na21/image-processing-on-delta-lake).

Note: it'd be interesting to have a link to the source code used in the demo.

I have a slightly different scenario, though. Testing is being performed on a local machine following the quick tutorial (https://docs.delta.io/latest/quick-start.html). In this scenario, what is the best way (using as much out-of-the-box components as possible) to "grab" a local folder with images organized into subfolders (classes) and dump them into delta lake and then use a specific snapshot on tensorflow?

Thanks

-werners- · ‎09-09-2021

I can think of 3 ways for doing this:

using the web UI (the create table option or upload data into DBFS)
using databricks-connect, which bridges your local machine with the remote databricks clusters
using the databricks-cli to copy local files to dbfs

your cloud vendor might also have a tool to copy local data into the cloud environment.

For your purpose (evaluating) the web UI option might be the easiest.

https://docs.databricks.com/data/data.html

https://docs.microsoft.com/en-us/azure/databricks/data/databricks-file-system#file-upload-interface