topic Difference between DBFS and Delta Lake? in Data Engineering

Difference between DBFS and Delta Lake?

pjp94 — Fri, 28 Jan 2022 20:54:18 GMT

Would like a deeper dive/explanation into the difference. When I write to a table with the following code:

spark_df.write.mode("overwrite").saveAsTable("db.table")

The table is created and can be viewed in the Data tab. It can also be found in some DBFS path. Now if I run:

dbutils.fs.rm("{}".format(dbfs_path), recurse=True)

Where dbfs_path is a pathway to the table in DBFS, it will remove that table from DBFS, however it is still in the Data tab (even though I know you can't call the table anymore inside the notebook because technically it no longer exists).

If I run:

%sql
DROP TABLE IF EXISTS db.table

Inside a cell, it will drop the table from the Data tab and DBFS. Can someone explain (high level) how the infrastructure works? Much appreciated.

Re: Difference between DBFS and Delta Lake?

-werners- — Mon, 31 Jan 2022 08:47:53 GMT

Tables in spark, delta lake-backed or not are basically just semantic views on top of the actual data.

On Databricks, the data itself is stored in DBFS, which is an abstraction layer on top of the actual storage (like S3, ADLS etct). this can be parquet, orc, csv, json etc.

So with your rm command you did indeed delete the data from DBFS. However, the table definition still exists (it is stored in a metastore which contains metadata about which databases and tables exist and where the data resides).

So now you have an empty table. To remove the table definition too, you have to drop it, exactly like you did.

For completeness: delta lake has nothing to do with this. Delta lake is parquet on steroids giving you a lot more functionalities, but the way of working stays identical.

Re: Difference between DBFS and Delta Lake?

pjp94 — Tue, 01 Feb 2022 14:45:02 GMT

Hi @Werner Stinckens , this is exactly what I was looking for. Thanks!

1) Follow up questions, do you need to setup an object level storage connection on databricks (ie. to an S3 bucket or Azure Blob)?

2) Any folders in your /mnt path are external object stores (ie S3, Blob Storage, etc.), correct? Everything else is stored in the databricks root? I ask because my organization has 2 folders in the /mnt folder: /mnt/aws & /mnt/delta... not sure if delta refers to delta lake?

3) So delta lake and dbfs are independent of eachother, correct? DBFS is where the data is actually stored (ie if I wrote a table, then the parquet files). How does Delta Lake fit into this?

Thanks so much!

Re: Difference between DBFS and Delta Lake?

-werners- — Tue, 01 Feb 2022 16:32:28 GMT

1) you don´t have to as a databricks workspace has it's own storage, but it certainly is a good idea

2)not all folders in /mnt are external. Only the ones you mounted in there yourself.

3)correct. Delta lake is just a file format like parquet, but with more possibilities.