07-09-2023 07:55 PM
Hi, I recently started learning about spark. I was studying about spark managed tables. so as per docs " spark manages the both the data and metadata". Assume that i have a csv file in s3 and I read it into data frame like below.
df = spark.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("s3a://databricks-learning-s333/temp/flights.csv")
now i created a spark managed table in data bricks as below..
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")
spark.sql("CREATE TABLE managed_us_delay_flights_tbl (date STRING, delay INT,
distance INT, origin STRING, destination STRING)")
df.write.saveAsTable("managed_us_delay_flights_tbl")
now it is a spark managed table, so spark manages both the data and metadata.
as per docs, if we delete managed table spark deletes managed table it will delete the both metadata and actual data (docs)
Here are my questions:
1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata.
spark.sql('DROP TABLE managed_us_delay_flights_tbl')
2. I read here that when we create managed tables, spark uses the delta format, actually my original data in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same with and write it in delta format somewhere ?
3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail.
Thank you so much for your help.
07-09-2023 08:31 PM
Hi @Raviiit thanks for posting here,
As per your question,
Q1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata?
Ans. It will not delete your original s3 data, As you are creating managed table data is stored in dbfs under /user/hive/warehouse/learn_spark_db.db/ folder. After executing the drop statement, data will be deleted from /user/hive/warehouse/learn_spark_db.db/ directory not from S3.
If you provide location during the creation of the table, it will be treated as an unmanaged table, and only metadata is deleted while dropping the table.
Q2. I read here that when we create managed tables, spark uses the delta format, actually, my original data is in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same and write it in delta format somewhere ?
Ans: It will not change the original data in S3, what it will do it will write the same data in the dfbs under /user/hive/warehouse/learn_spark_db.db/ location as delta format if you don't specify any format.
You can see the new data file using the databricks utility:
dbutils.fs.ls("/user/hive/warehouse/learn_spark_db.db/")
Q3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail?
Ans: So Whenever you create a databricks resource an underlying storage account is also created for storing the data typically known as the databricks file system(dbfs).
you can see using UI or using databricks utility :
dbutils.fs.ls("/")
Now to your question, whenever anyone can create a managed table it will store both metadata and data in an underlying databricks managed storage account.
You can see this using:
dbutils.fs.ls("/user/hive/warehouse/")
07-09-2023 08:31 PM
Hi @Raviiit thanks for posting here,
As per your question,
Q1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata?
Ans. It will not delete your original s3 data, As you are creating managed table data is stored in dbfs under /user/hive/warehouse/learn_spark_db.db/ folder. After executing the drop statement, data will be deleted from /user/hive/warehouse/learn_spark_db.db/ directory not from S3.
If you provide location during the creation of the table, it will be treated as an unmanaged table, and only metadata is deleted while dropping the table.
Q2. I read here that when we create managed tables, spark uses the delta format, actually, my original data is in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same and write it in delta format somewhere ?
Ans: It will not change the original data in S3, what it will do it will write the same data in the dfbs under /user/hive/warehouse/learn_spark_db.db/ location as delta format if you don't specify any format.
You can see the new data file using the databricks utility:
dbutils.fs.ls("/user/hive/warehouse/learn_spark_db.db/")
Q3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail?
Ans: So Whenever you create a databricks resource an underlying storage account is also created for storing the data typically known as the databricks file system(dbfs).
you can see using UI or using databricks utility :
dbutils.fs.ls("/")
Now to your question, whenever anyone can create a managed table it will store both metadata and data in an underlying databricks managed storage account.
You can see this using:
dbutils.fs.ls("/user/hive/warehouse/")
07-09-2023 09:33 PM
Thank you so much @Hemant for detailed answer. one follow up question is, does data bricks DBFS use our underlying cloud storage? For example, if we use AWS , then does it store data in AWS and DBFS provides logical view of that data? please correct me if something is wrong.
07-09-2023 10:08 PM
Thanks @Raviiit, yes you are correct, whenever you are creating a databricks resource on the cloud, a storage account is associated with it and DBFS provides a logical view of the storage.
For more understanding please go through the below link:
https://docs.databricks.com/getting-started/overview.html
07-09-2023 10:05 PM
Yes, @Raviiit
DBFS (Databricks File System) is a distributed file system used by Databricks clusters. DBFS is an abstraction layer over cloud storage (e.g. S3 or Azure Blob Store), allowing external storage buckets to be mounted as paths in the DBFS namespace
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group