cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Difference between DBFS and Delta Lake?

pjp94
Contributor

Would like a deeper dive/explanation into the difference. When I write to a table with the following code:

spark_df.write.mode("overwrite").saveAsTable("db.table")

The table is created and can be viewed in the Data tab. It can also be found in some DBFS path. Now if I run:

dbutils.fs.rm("{}".format(dbfs_path), recurse=True)

Where dbfs_path is a pathway to the table in DBFS, it will remove that table from DBFS, however it is still in the Data tab (even though I know you can't call the table anymore inside the notebook because technically it no longer exists).

If I run:

%sql
DROP TABLE IF EXISTS db.table

Inside a cell, it will drop the table from the Data tab and DBFS. Can someone explain (high level) how the infrastructure works? Much appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

Tables in spark, delta lake-backed or not are basically just semantic views on top of the actual data.

On Databricks, the data itself is stored in DBFS, which is an abstraction layer on top of the actual storage (like S3, ADLS etct). this can be parquet, orc, csv, json etc.

So with your rm command you did indeed delete the data from DBFS. However, the table definition still exists (it is stored in a metastore which contains metadata about which databases and tables exist and where the data resides).

So now you have an empty table. To remove the table definition too, you have to drop it, exactly like you did.

For completeness: delta lake has nothing to do with this. Delta lake is parquet on steroids giving you a lot more functionalities, but the way of working stays identical.

View solution in original post

5 REPLIES 5

Kaniz
Community Manager
Community Manager

Hi @Paras Patel​ ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

-werners-
Esteemed Contributor III

Tables in spark, delta lake-backed or not are basically just semantic views on top of the actual data.

On Databricks, the data itself is stored in DBFS, which is an abstraction layer on top of the actual storage (like S3, ADLS etct). this can be parquet, orc, csv, json etc.

So with your rm command you did indeed delete the data from DBFS. However, the table definition still exists (it is stored in a metastore which contains metadata about which databases and tables exist and where the data resides).

So now you have an empty table. To remove the table definition too, you have to drop it, exactly like you did.

For completeness: delta lake has nothing to do with this. Delta lake is parquet on steroids giving you a lot more functionalities, but the way of working stays identical.

Hi @Werner Stinckens​ , this is exactly what I was looking for. Thanks!

1) Follow up questions, do you need to setup an object level storage connection on databricks (ie. to an S3 bucket or Azure Blob)?

2) Any folders in your /mnt path are external object stores (ie S3, Blob Storage, etc.), correct? Everything else is stored in the databricks root? I ask because my organization has 2 folders in the /mnt folder: /mnt/aws & /mnt/delta... not sure if delta refers to delta lake?

3) So delta lake and dbfs are independent of eachother, correct? DBFS is where the data is actually stored (ie if I wrote a table, then the parquet files). How does Delta Lake fit into this?

Thanks so much!

-werners-
Esteemed Contributor III

1) you don´t have to as a databricks workspace has it's own storage, but it certainly is a good idea

2)not all folders in /mnt are external. Only the ones you mounted in there yourself.

3)correct. Delta lake is just a file format like parquet, but with more possibilities.

Kaniz
Community Manager
Community Manager

Hi @Paras Patel​ , Were your questions answered?

If yes, would you like to mark @Werner Stinckens​ 's answer as the best answer?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.