cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

spark managed tables

Raviiit
New Contributor II

Hi, I recently started learning about spark.  I was studying about spark managed tables. so as per docs " spark manages the both the data and metadata". Assume that i have a csv file in s3 and I read it into data frame like below.

df = spark.read
.format("csv")
.option("header", "true") 
.option("inferSchema", "true") 
.load("s3a://databricks-learning-s333/temp/flights.csv")

now i created a spark managed table in data bricks as below..

spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")

spark.sql("CREATE TABLE managed_us_delay_flights_tbl (date STRING, delay INT,  
  distance INT, origin STRING, destination STRING)")

df.write.saveAsTable("managed_us_delay_flights_tbl")

now it is a spark managed table, so spark manages both the data and metadata.

 

as per docs, if we delete managed table spark deletes managed table it will delete the both metadata and actual data (docs)

 

Here are my questions:

1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata. 

spark.sql('DROP TABLE managed_us_delay_flights_tbl')

2. I read here that when we create managed tables, spark uses the delta format, actually my original data in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same with and write it in delta format somewhere ? 

 

3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail. 

 

 

Thank you so much for your help.

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hemant
Valued Contributor II

Hi @Raviiit thanks for posting here,
As per your question,


Q1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata? 

Ans. It will not delete your original s3 data, As you are creating managed table data is stored in dbfs under /user/hive/warehouse/learn_spark_db.db/ folder. After executing the drop statement, data will be deleted from /user/hive/warehouse/learn_spark_db.db/ directory not from S3.
If you provide location during the creation of the table, it will be treated as an unmanaged table, and only metadata is deleted while dropping the table.

Q2. I read here that when we create managed tables, spark uses the delta format, actually, my original data is in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same and write it in delta format somewhere ?

Ans: It will not change the original data in S3, what it will do it will write the same data in the dfbs under /user/hive/warehouse/learn_spark_db.db/ location as delta format if you don't specify any format.
You can see the new data file using the databricks utility:
dbutils.fs.ls("/user/hive/warehouse/learn_spark_db.db/")

Q3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail?

Ans: So Whenever you create a databricks resource an underlying storage account is also created for storing the data typically known as the databricks file system(dbfs). 

you can see using UI or using databricks utility :

dbutils.fs.ls("/")

Now to your question, whenever anyone can create a managed table it will store both metadata and data in an underlying databricks managed storage account.

You can see this using:
dbutils.fs.ls("/user/hive/warehouse/")

 

 

Hemant Soni

View solution in original post

4 REPLIES 4

Hemant
Valued Contributor II

Hi @Raviiit thanks for posting here,
As per your question,


Q1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata? 

Ans. It will not delete your original s3 data, As you are creating managed table data is stored in dbfs under /user/hive/warehouse/learn_spark_db.db/ folder. After executing the drop statement, data will be deleted from /user/hive/warehouse/learn_spark_db.db/ directory not from S3.
If you provide location during the creation of the table, it will be treated as an unmanaged table, and only metadata is deleted while dropping the table.

Q2. I read here that when we create managed tables, spark uses the delta format, actually, my original data is in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same and write it in delta format somewhere ?

Ans: It will not change the original data in S3, what it will do it will write the same data in the dfbs under /user/hive/warehouse/learn_spark_db.db/ location as delta format if you don't specify any format.
You can see the new data file using the databricks utility:
dbutils.fs.ls("/user/hive/warehouse/learn_spark_db.db/")

Q3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail?

Ans: So Whenever you create a databricks resource an underlying storage account is also created for storing the data typically known as the databricks file system(dbfs). 

you can see using UI or using databricks utility :

dbutils.fs.ls("/")

Now to your question, whenever anyone can create a managed table it will store both metadata and data in an underlying databricks managed storage account.

You can see this using:
dbutils.fs.ls("/user/hive/warehouse/")

 

 

Hemant Soni

Raviiit
New Contributor II

Thank you so much @Hemant for detailed answer. one follow up question is, does data bricks DBFS use our underlying cloud storage? For example, if we use AWS , then does it store data in AWS and DBFS provides logical view of that data? please correct me if something is wrong.

Hemant
Valued Contributor II

Thanks @Raviiit, yes you are correct, whenever you are creating a databricks resource on the cloud, a storage account is associated with it and DBFS provides a logical view of the storage.
For more understanding please go through the below link:
https://docs.databricks.com/getting-started/overview.html 

Hemant Soni

Tharun-Kumar
Honored Contributor II
Honored Contributor II

Yes, @Raviiit 

DBFS (Databricks File System) is a distributed file system used by Databricks clusters. DBFS is an abstraction layer over cloud storage (e.g. S3 or Azure Blob Store), allowing external storage buckets to be mounted as paths in the DBFS namespace

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.