Databricks

jmill · ‎03-09-2023

Apologies in advance for the soft question, but I'm genuinely struggling with this.

We're a small data science unit just setting up in Databricks. While we do run some intensive ETL and analytics jobs, a non-trivial part of the team's BAU is exploratory desktop analytics. E.g. this might involve being sent spreadsheets by other organisations, or downloading random bits of data from the web to do ad hoc, small pieces of analytics in python or R.

What is the recommended way of organising and persisting files for such workflows? Using the DBFS file system to read and write from object storage seems like the obvious solution, but the Databricks documentation seems to be giving mixed messages on this. E.g. the following 2 articles from the docs (article1, article2) state pretty explicitly right up front that:

"Databricks recommends against using DBFS and mounted cloud object storage for most use cases in Unity Catalog-enabled Azure Databricks workspaces.

and

"Mounted data does not work with Unity Catalog, and Databricks recommends migrating away from using mounts and managing data governance with Unity Catalog".

So, what's best practice for such workflows?

-werners- · ‎03-09-2023

The articles you mention are specific about the use of Unity Catalog (a feature you CAN use in Databricks but don't have to). It is saying that if you use Unity, dbfs mounts will not work.

If you do not use unity, you can perfectly mount your cloud storage in dbfs.

Besides that: you can always access cloud storage without a mount. Instead of using a file path like '/mnt/datalake/...' you use 'S3://...' or 'abfss://...'

If you need Unity or not is another discussion as it has advantages but also limitations.

Anonymous · ‎03-09-2023

You can also upload data in the UI

I wouldn't worry about doing something the best way, just do it the way that will get the work done. We haven't made it so you can make giant mistakes and you can always change things in the future.

Data Summarize and AutoML should help a great deal in starting projects.

pvignesh92 · ‎03-09-2023

Hi,

This is what I usually follow. See if this helps

When I have a small sample data in my local disk or any data shared by my upstream colleagues over email in csv format, I simply use the 'Import and Export data' option in the Databricks UI and upload my file to a DBFS path I want and use that path for loading to Spark data frame
If my files are created my another upstream Databricks job, that will anyway be on the the path accessible by the Databricks cluster. So I read from there.

Our cluster is hosted on AWS but I don't think it is different to Azure

Vartika · ‎03-31-2023

Hi @Jason Millburn

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Databricks

What is best practice for organising simple desktop-style analytics workflows in Databricks?

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI