03-09-2023 02:36 AM
Apologies in advance for the soft question, but I'm genuinely struggling with this.
We're a small data science unit just setting up in Databricks. While we do run some intensive ETL and analytics jobs, a non-trivial part of the team's BAU is exploratory desktop analytics. E.g. this might involve being sent spreadsheets by other organisations, or downloading random bits of data from the web to do ad hoc, small pieces of analytics in python or R.
What is the recommended way of organising and persisting files for such workflows? Using the DBFS file system to read and write from object storage seems like the obvious solution, but the Databricks documentation seems to be giving mixed messages on this. E.g. the following 2 articles from the docs (article1, article2) state pretty explicitly right up front that:
"Databricks recommends against using DBFS and mounted cloud object storage for most use cases in Unity Catalog-enabled Azure Databricks workspaces.
and
"Mounted data does not work with Unity Catalog, and Databricks recommends migrating away from using mounts and managing data governance with Unity Catalog".
So, what's best practice for such workflows?
03-09-2023 03:49 AM
The articles you mention are specific about the use of Unity Catalog (a feature you CAN use in Databricks but don't have to). It is saying that if you use Unity, dbfs mounts will not work.
If you do not use unity, you can perfectly mount your cloud storage in dbfs.
Besides that: you can always access cloud storage without a mount. Instead of using a file path like '/mnt/datalake/...' you use 'S3://...' or 'abfss://...'
If you need Unity or not is another discussion as it has advantages but also limitations.
03-09-2023 04:17 AM
You can also upload data in the UI
I wouldn't worry about doing something the best way, just do it the way that will get the work done. We haven't made it so you can make giant mistakes and you can always change things in the future.
Data Summarize and AutoML should help a great deal in starting projects.
03-09-2023 06:10 AM
Hi,
This is what I usually follow. See if this helps
Our cluster is hosted on AWS but I don't think it is different to Azure
03-31-2023 02:51 AM
Hi @Jason Millburn
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group