Databricks Community

Lizhi_Dong · ‎01-23-2023

What would be the best plan for independent course creator?

Hi folks! I want to use databrick community edition as the platform to teach online courses. As you may know, for community edition, you need to create a new cluster when the old one terminates. I found out however tables created from the old cluster would disappear as well, so I have to re-create tables every time I start the cluster.

I wonder if anyone knows how to fix this problem. I have a couple tables with static data with 100+ questions. It will be too much repetitive work if students need to upload the data and re-create the steps every time they re-start a new cluster.

Looking forward to your advice!

Debayan · ‎01-23-2023

Hi, It will be helpful if you provide the screenshots before and after the incident.

jose_gonzalez · ‎01-24-2023

Hi @Lizhi Dong,

This might be a limitation from Community Edition. When your cluster gets terminated all your tables will be removed.

jose_gonzalez · ‎02-24-2023

Hi @Lizhi Dong,

Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

venkateshgunda · ‎06-07-2024

I am also facing the same issue

venkateshgunda · ‎06-10-2024

Separation of Storage and Compute:
- Databricks separates storage from compute. Data stored in DBFS (Databricks File System) or external storage systems is persistent and not tied to the lifecycle of a cluster.
- When you create tables and databases in Databricks, the data itself is stored in DBFS or external storage, but the metadata (such as table schema and definitions) is stored in the cluster’s local metastore or in an external Hive metastore if configured.
Cluster Metadata:
- The default Hive metastore that comes with Databricks clusters in the Community Edition is stored locally on the cluster.
- When the cluster is terminated, the local Hive metastore is lost because it resides on the ephemeral storage of the cluster. This means that all metadata about tables and databases, such as their schemas and structure, is lost.
Persistence of Data Files:
- The actual data files are stored in DBFS, which is a persistent storage system. DBFS is designed to retain data even if the cluster is terminated. This is why your data files remain intact.
- The loss of metadata only affects the ability to query the data using the previously defined tables and databases. The data itself remains safe in DBFS.

Solutions to Persist Metadata

To persist the metadata (schemas, table definitions, etc.) across cluster restarts, you can use the following methods:

External Hive Metastore:
- Configure your Databricks cluster to use an external Hive metastore. This way, metadata is stored in a persistent external database (e.g., MySQL, PostgreSQL).
- When you restart the cluster, it connects to the same external metastore, and all your metadata is preserved.
Example of configuring an external Hive metastore:
shell
spark.sql.hive.metastore.jars /path/to/hive/jars spark.sql.hive.metastore.version 2.3.7 spark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://<hostname>:<port>/metastore_db spark.hadoop.javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver spark.hadoop.javax.jdo.option.ConnectionUserName <username> spark.hadoop.javax.jdo.option.ConnectionPassword <password>
Metastore Persistence in DBFS:
- Save your table definitions as scripts or commands in DBFS and re-run them when the cluster restarts.
- This can be automated using initialization scripts or Databricks Jobs.
Example of saving table definitions and recreating them:
python
# Save table creation script to DBFS table_creation_script = """ CREATE TABLE IF NOT EXISTS my_table ( id INT, name STRING ) USING parquet LOCATION '/dbfs/path/to/data'; """ dbutils.fs.put("/dbfs/path/to/scripts/create_table.sql", table_creation_script, overwrite=True) # Load and execute the script upon cluster start with open("/dbfs/path/to/scripts/create_table.sql", "r") as f: spark.sql(f.read())

By using these methods, you can ensure that both your data and metadata are preserved across cluster restarts in Databricks Community Edition.

Shivanshu_ · ‎06-12-2024

I believe only the metadata get's removed from HMS not the delta files from dbfs. Instead of loading the data again and again try using ctas with that dbfs location.

Databricks Community

Tables disappear when I re-start a new cluster on Community Edition

Solutions to Persist Metadata

Connect with Databricks Users in Your Area

Submit your feedback and win a $50 gift card!

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!