cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
cancel
Showing results for 
Search instead for 
Did you mean: 

What is best practice for organising simple desktop-style analytics workflows in Databricks?

jmill
New Contributor

Apologies in advance for the soft question, but I'm genuinely struggling with this.

We're a small data science unit just setting up in Databricks. While we do run some intensive ETL and analytics jobs, a non-trivial part of the team's BAU is exploratory desktop analytics. E.g. this might involve being sent spreadsheets by other organisations, or downloading random bits of data from the web to do ad hoc, small pieces of analytics in python or R.

What is the recommended way of organising and persisting files for such workflows? Using the DBFS file system to read and write from object storage seems like the obvious solution, but the Databricks documentation seems to be giving mixed messages on this. E.g. the following 2 articles from the docs (article1, article2) state pretty explicitly right up front that:

"Databricks recommends against using DBFS and mounted cloud object storage for most use cases in Unity Catalog-enabled Azure Databricks workspaces.

and

"Mounted data does not work with Unity Catalog, and Databricks recommends migrating away from using mounts and managing data governance with Unity Catalog".

So, what's best practice for such workflows?

4 REPLIES 4

-werners-
Esteemed Contributor III

The articles you mention are specific about the use of Unity Catalog (a feature you CAN use in Databricks but don't have to). It is saying that if you use Unity, dbfs mounts will not work.

If you do not use unity, you can perfectly mount your cloud storage in dbfs.

Besides that: you can always access cloud storage without a mount. Instead of using a file path like '/mnt/datalake/...' you use 'S3://...' or 'abfss://...'

If you need Unity or not is another discussion as it has advantages but also limitations.

Anonymous
Not applicable

You can also upload data in the UIImage 

I wouldn't worry about doing something the best way, just do it the way that will get the work done. We haven't made it so you can make giant mistakes and you can always change things in the future.

Data Summarize and AutoML should help a great deal in starting projects.

pvignesh92
Honored Contributor

Hi,

This is what I usually follow. See if this helps

  1. When I have a small sample data in my local disk or any data shared by my upstream colleagues over email in csv format, I simply use the 'Import and Export data' option in the Databricks UI and upload my file to a DBFS path I want and use that path for loading to Spark data frame
  2. If my files are created my another upstream Databricks job, that will anyway be on the the path accessible by the Databricks cluster. So I read from there.

Our cluster is hosted on AWS but I don't think it is different to Azure

Vartika
Moderator
Moderator

Hi @Jason Millburn​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.