Databricks

pjp94 · ‎01-28-2022

Would like a deeper dive/explanation into the difference. When I write to a table with the following code:

spark_df.write.mode("overwrite").saveAsTable("db.table")

The table is created and can be viewed in the Data tab. It can also be found in some DBFS path. Now if I run:

dbutils.fs.rm("{}".format(dbfs_path), recurse=True)

Where dbfs_path is a pathway to the table in DBFS, it will remove that table from DBFS, however it is still in the Data tab (even though I know you can't call the table anymore inside the notebook because technically it no longer exists).

If I run:

%sql
DROP TABLE IF EXISTS db.table

Inside a cell, it will drop the table from the Data tab and DBFS. Can someone explain (high level) how the infrastructure works? Much appreciated.

-werners- · ‎01-31-2022

Tables in spark, delta lake-backed or not are basically just semantic views on top of the actual data.

On Databricks, the data itself is stored in DBFS, which is an abstraction layer on top of the actual storage (like S3, ADLS etct). this can be parquet, orc, csv, json etc.

So with your rm command you did indeed delete the data from DBFS. However, the table definition still exists (it is stored in a metastore which contains metadata about which databases and tables exist and where the data resides).

So now you have an empty table. To remove the table definition too, you have to drop it, exactly like you did.

For completeness: delta lake has nothing to do with this. Delta lake is parquet on steroids giving you a lot more functionalities, but the way of working stays identical.

View solution in original post

Kaniz · ‎01-29-2022

Hi @Paras Patel ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

-werners- · ‎01-31-2022

Tables in spark, delta lake-backed or not are basically just semantic views on top of the actual data.

On Databricks, the data itself is stored in DBFS, which is an abstraction layer on top of the actual storage (like S3, ADLS etct). this can be parquet, orc, csv, json etc.

So with your rm command you did indeed delete the data from DBFS. However, the table definition still exists (it is stored in a metastore which contains metadata about which databases and tables exist and where the data resides).

So now you have an empty table. To remove the table definition too, you have to drop it, exactly like you did.

For completeness: delta lake has nothing to do with this. Delta lake is parquet on steroids giving you a lot more functionalities, but the way of working stays identical.

pjp94 · ‎02-01-2022

Hi @Werner Stinckens , this is exactly what I was looking for. Thanks!

1) Follow up questions, do you need to setup an object level storage connection on databricks (ie. to an S3 bucket or Azure Blob)?

2) Any folders in your /mnt path are external object stores (ie S3, Blob Storage, etc.), correct? Everything else is stored in the databricks root? I ask because my organization has 2 folders in the /mnt folder: /mnt/aws & /mnt/delta... not sure if delta refers to delta lake?

3) So delta lake and dbfs are independent of eachother, correct? DBFS is where the data is actually stored (ie if I wrote a table, then the parquet files). How does Delta Lake fit into this?

Thanks so much!

-werners- · ‎02-01-2022

1) you don´t have to as a databricks workspace has it's own storage, but it certainly is a good idea

2)not all folders in /mnt are external. Only the ones you mounted in there yourself.

3)correct. Delta lake is just a file format like parquet, but with more possibilities.

Kaniz · ‎02-04-2022

Hi @Paras Patel , Were your questions answered?

If yes, would you like to mark @Werner Stinckens 's answer as the best answer?

Databricks

Difference between DBFS and Delta Lake?

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI