cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

maintaining cluster and databases in Databricks Community Edition

choi_2
New Contributor II

I am using the Databricks Community Edition, but the cluster usage is limited to 2 hours and it automatically terminates. So I have to attach the cluster every time to run the notebook again. As I read other discussions, I learned it is not something to fix as I am using the CE. But I am wondering if there is any way to keep the databases at least? because the problem is I also lose access to the data tables I created, but the datasets are very large and it takes a really long time for me to upload over and over again. Even though the cluster is terminated, all the notebooks in the workspace still remain, so I am not sure why it would not work the same way for the data as well. Can anyone suggest to me if there is any way to keep the data in the databases?

To solve this problem, I even joined Azure Databricks and AWS Databricks. but as I ran my notebooks through those, it exceeded the limits and I got charged a lot for that, so I had to quit because I am just using it for my school project as a student (it has been very difficult to work on because of this issue). 

So I am trying to figure out if there is a way to use this CE while keeping the databases. Please let me know if you could suggest anything for me.

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @choi_2

I understand the challenges you’re facing with Databricks Community Edition (CE) and the limitations it imposes on cluster usage. While CE provides a micro-cluster and a notebook environment, it does have some restrictions. 

 

Let’s address your concerns:

 

Cluster Termination:

  • CE clusters are limited to 2 hours of usage and automatically terminate after that time. Unfortunately, this behaviour cannot be changed in CE.

Data Retention:

  • Unlike notebooks, data tables and databases are not retained after cluster termination. When a cluster is terminated, any data stored in memory or temporary tables associated with that cluster is lost.
  • Notebooks remain intact because they are stored separately from the cluster. However, data tables are tied to the cluster’s memory and storage, so they are not preserved when the cluster terminates.

Possible Workarounds:

 

  • External Storage:
    • Consider using external storage solutions such as Azure Blob Storage or Amazon S3. You can store your large datasets there and access them from your notebooks.
    • When you create tables in Databricks, you can store them in an external location (e.g., Azure Blob Storage or S3). This way, even if the cluster terminates, the data remains accessible.
  • Delta Lake:
    • If you’re working with large datasets, consider using Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.
    • Delta Lake provides features like versioning, schema enforcement, and data consistency. It allows you to manage large datasets efficiently.
  • Scheduled Jobs:
    • Set up scheduled jobs to run periodically. During cluster runtime, these jobs can load data into your tables from external storage (e.g., CSV files, Parquet files).
    • Doing this ensures that your data is always available, even if the cluster terminates.
  • Pin Clusters:
    • Although this won’t directly retain data, you can pin clusters in Databricks. Pinned clusters are not automatically terminated, so you can keep them running for longer periods.
    • However, this approach may not be ideal for large-scale data processing due to resource limitations in CE.

Considerations for Azure Databricks and AWS Databricks:

  • While these platforms offer more flexibility, they come with costs. As a student, budget constraints can be challenging.
  • If you decide to use Azure Databricks or AWS Databricks, be cautious about the resources you allocate to avoid unexpected charges.

In summary, while CE has limitations, leveraging external storage, Delta Lake, and scheduled jobs can help you manage your data effectively. Remember to balance convenience with cost considerations. Feel free to ask if you have specific use cases or need further assistance! 😊

View solution in original post

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @choi_2

I understand the challenges you’re facing with Databricks Community Edition (CE) and the limitations it imposes on cluster usage. While CE provides a micro-cluster and a notebook environment, it does have some restrictions. 

 

Let’s address your concerns:

 

Cluster Termination:

  • CE clusters are limited to 2 hours of usage and automatically terminate after that time. Unfortunately, this behaviour cannot be changed in CE.

Data Retention:

  • Unlike notebooks, data tables and databases are not retained after cluster termination. When a cluster is terminated, any data stored in memory or temporary tables associated with that cluster is lost.
  • Notebooks remain intact because they are stored separately from the cluster. However, data tables are tied to the cluster’s memory and storage, so they are not preserved when the cluster terminates.

Possible Workarounds:

 

  • External Storage:
    • Consider using external storage solutions such as Azure Blob Storage or Amazon S3. You can store your large datasets there and access them from your notebooks.
    • When you create tables in Databricks, you can store them in an external location (e.g., Azure Blob Storage or S3). This way, even if the cluster terminates, the data remains accessible.
  • Delta Lake:
    • If you’re working with large datasets, consider using Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.
    • Delta Lake provides features like versioning, schema enforcement, and data consistency. It allows you to manage large datasets efficiently.
  • Scheduled Jobs:
    • Set up scheduled jobs to run periodically. During cluster runtime, these jobs can load data into your tables from external storage (e.g., CSV files, Parquet files).
    • Doing this ensures that your data is always available, even if the cluster terminates.
  • Pin Clusters:
    • Although this won’t directly retain data, you can pin clusters in Databricks. Pinned clusters are not automatically terminated, so you can keep them running for longer periods.
    • However, this approach may not be ideal for large-scale data processing due to resource limitations in CE.

Considerations for Azure Databricks and AWS Databricks:

  • While these platforms offer more flexibility, they come with costs. As a student, budget constraints can be challenging.
  • If you decide to use Azure Databricks or AWS Databricks, be cautious about the resources you allocate to avoid unexpected charges.

In summary, while CE has limitations, leveraging external storage, Delta Lake, and scheduled jobs can help you manage your data effectively. Remember to balance convenience with cost considerations. Feel free to ask if you have specific use cases or need further assistance! 😊

choi_2
New Contributor II

Thank you so much for the response, @Kaniz_Fatma 

So would Azure Blob Storage, Amazon S3, and Delta Lake be free to use? If so, which one is the easiest one to for the first time user?

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!