cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Table is being dropped when cluster terminates in comunity edition

Sas
New Contributor II

Hi Expert

I have created an external table in databricks community edition. Table is external table. But when i cluster is terminated, i am not able to query the table any more. What is the reason? What i need to do so that table is not dropped. Table creation script is as below

DeltaTable.createOrReplace(spark)\
    .tableName("delta_internal_demo")\
    .addColumn("emp_id",'INT')\
    .addColumn("emp_name",'STRING')\
    .addColumn("gender",'STRING')\
    .addColumn("Salary",'INT')\
    .addColumn('Dept','STRING')\
    .property("description",'table created for demo purpose')\
    .location("/FileStore/tables/delta/archival_demo")\
    .execute()
1 ACCEPTED SOLUTION

Accepted Solutions

jose_gonzalez
Databricks Employee
Databricks Employee

This is an expected behavior to the community edition clusters. Upon termination of the cluster, the data is purged.

View solution in original post

3 REPLIES 3

jose_gonzalez
Databricks Employee
Databricks Employee

This is an expected behavior to the community edition clusters. Upon termination of the cluster, the data is purged.

Sas
New Contributor II

Thanks for your reply. I am seeing that data files are still there. But table defination is dropped. When i recreate the table with same script, i am able to access old data

venkateshgunda
New Contributor III
  1. Separation of Storage and Compute:

    • Databricks separates storage from compute. Data stored in DBFS (Databricks File System) or external storage systems is persistent and not tied to the lifecycle of a cluster.
    • When you create tables and databases in Databricks, the data itself is stored in DBFS or external storage, but the metadata (such as table schema and definitions) is stored in the cluster’s local metastore or in an external Hive metastore if configured.
  2. Cluster Metadata:

    • The default Hive metastore that comes with Databricks clusters in the Community Edition is stored locally on the cluster.
    • When the cluster is terminated, the local Hive metastore is lost because it resides on the ephemeral storage of the cluster. This means that all metadata about tables and databases, such as their schemas and structure, is lost.
  3. Persistence of Data Files:

    • The actual data files are stored in DBFS, which is a persistent storage system. DBFS is designed to retain data even if the cluster is terminated. This is why your data files remain intact.
    • The loss of metadata only affects the ability to query the data using the previously defined tables and databases. The data itself remains safe in DBFS.

Solutions to Persist Metadata

To persist the metadata (schemas, table definitions, etc.) across cluster restarts, you can use the following methods:

  1. External Hive Metastore:

    • Configure your Databricks cluster to use an external Hive metastore. This way, metadata is stored in a persistent external database (e.g., MySQL, PostgreSQL).
    • When you restart the cluster, it connects to the same external metastore, and all your metadata is preserved.

    Example of configuring an external Hive metastore:

    shell
    Copy code
    spark.sql.hive.metastore.jars /path/to/hive/jars spark.sql.hive.metastore.version 2.3.7 spark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://<hostname>:<port>/metastore_db spark.hadoop.javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver spark.hadoop.javax.jdo.option.ConnectionUserName <username> spark.hadoop.javax.jdo.option.ConnectionPassword <password>
  2. Metastore Persistence in DBFS:

    • Save your table definitions as scripts or commands in DBFS and re-run them when the cluster restarts.
    • This can be automated using initialization scripts or Databricks Jobs.

    Example of saving table definitions and recreating them:

    python
    Copy code
    # Save table creation script to DBFS table_creation_script = """ CREATE TABLE IF NOT EXISTS my_table ( id INT, name STRING ) USING parquet LOCATION '/dbfs/path/to/data'; """ dbutils.fs.put("/dbfs/path/to/scripts/create_table.sql", table_creation_script, overwrite=True) # Load and execute the script upon cluster start with open("/dbfs/path/to/scripts/create_table.sql", "r") as f: spark.sql(f.read())

By using these methods, you can ensure that both your data and metadata are preserved across cluster restarts in Databricks Community Edition.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group