Databricks Community

pjp94 · ‎01-28-2022

Would like a deeper dive/explanation into the difference. When I write to a table with the following code:

spark_df.write.mode("overwrite").saveAsTable("db.table")

The table is created and can be viewed in the Data tab. It can also be found in some DBFS path. Now if I run:

dbutils.fs.rm("{}".format(dbfs_path), recurse=True)

Where dbfs_path is a pathway to the table in DBFS, it will remove that table from DBFS, however it is still in the Data tab (even though I know you can't call the table anymore inside the notebook because technically it no longer exists).

If I run:

%sql
DROP TABLE IF EXISTS db.table

Inside a cell, it will drop the table from the Data tab and DBFS. Can someone explain (high level) how the infrastructure works? Much appreciated.

-werners- · ‎01-31-2022

Tables in spark, delta lake-backed or not are basically just semantic views on top of the actual data.

On Databricks, the data itself is stored in DBFS, which is an abstraction layer on top of the actual storage (like S3, ADLS etct). this can be parquet, orc, csv, json etc.

So with your rm command you did indeed delete the data from DBFS. However, the table definition still exists (it is stored in a metastore which contains metadata about which databases and tables exist and where the data resides).

So now you have an empty table. To remove the table definition too, you have to drop it, exactly like you did.

For completeness: delta lake has nothing to do with this. Delta lake is parquet on steroids giving you a lot more functionalities, but the way of working stays identical.

View solution in original post

-werners- · ‎01-31-2022

Tables in spark, delta lake-backed or not are basically just semantic views on top of the actual data.

On Databricks, the data itself is stored in DBFS, which is an abstraction layer on top of the actual storage (like S3, ADLS etct). this can be parquet, orc, csv, json etc.

So with your rm command you did indeed delete the data from DBFS. However, the table definition still exists (it is stored in a metastore which contains metadata about which databases and tables exist and where the data resides).

So now you have an empty table. To remove the table definition too, you have to drop it, exactly like you did.

For completeness: delta lake has nothing to do with this. Delta lake is parquet on steroids giving you a lot more functionalities, but the way of working stays identical.

pjp94 · ‎02-01-2022

Hi @Werner Stinckens , this is exactly what I was looking for. Thanks!

1) Follow up questions, do you need to setup an object level storage connection on databricks (ie. to an S3 bucket or Azure Blob)?

2) Any folders in your /mnt path are external object stores (ie S3, Blob Storage, etc.), correct? Everything else is stored in the databricks root? I ask because my organization has 2 folders in the /mnt folder: /mnt/aws & /mnt/delta... not sure if delta refers to delta lake?

3) So delta lake and dbfs are independent of eachother, correct? DBFS is where the data is actually stored (ie if I wrote a table, then the parquet files). How does Delta Lake fit into this?

Thanks so much!

-werners- · ‎02-01-2022

1) you don´t have to as a databricks workspace has it's own storage, but it certainly is a good idea

2)not all folders in /mnt are external. Only the ones you mounted in there yourself.

3)correct. Delta lake is just a file format like parquet, but with more possibilities.

Databricks Community

Difference between DBFS and Delta Lake?

Photos

Join Us as a Local Community Builder!

Virtual Learning Festival: 9 April - 30 April

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks Community Champion - March 2025 - Takuya Omi