Re: spark managed tables

Hemant · ‎07-09-2023

Hi @Raviiit thanks for posting here,
As per your question,

Q1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata?

Ans. It will not delete your original s3 data, As you are creating managed table data is stored in dbfs under /user/hive/warehouse/learn_spark_db.db/ folder. After executing the drop statement, data will be deleted from /user/hive/warehouse/learn_spark_db.db/ directory not from S3.
If you provide location during the creation of the table, it will be treated as an unmanaged table, and only metadata is deleted while dropping the table.

Q2. I read here that when we create managed tables, spark uses the delta format, actually, my original data is in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same and write it in delta format somewhere ?

Ans: It will not change the original data in S3, what it will do it will write the same data in the dfbs under /user/hive/warehouse/learn_spark_db.db/ location as delta format if you don't specify any format.
You can see the new data file using the databricks utility:
dbutils.fs.ls("/user/hive/warehouse/learn_spark_db.db/")

Q3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail?

Ans: So Whenever you create a databricks resource an underlying storage account is also created for storing the data typically known as the databricks file system(dbfs).

you can see using UI or using databricks utility :

dbutils.fs.ls("/")

Now to your question, whenever anyone can create a managed table it will store both metadata and data in an underlying databricks managed storage account.

You can see this using:
dbutils.fs.ls("/user/hive/warehouse/")

Hemant Soni

View solution in original post