cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

maintaining cluster and databases in Databricks Community Edition

choi_2
New Contributor II

I am using the Databricks Community Edition, but the cluster usage is limited to 2 hours and it automatically terminates. So I have to attach the cluster every time to run the notebook again. As I read other discussions, I learned it is not something to fix as I am using the CE. But I am wondering if there is any way to keep the databases at least? because the problem is I also lose access to the data tables I created, but the datasets are very large and it takes a really long time for me to upload over and over again. Even though the cluster is terminated, all the notebooks in the workspace still remain, so I am not sure why it would not work the same way for the data as well. Can anyone suggest to me if there is any way to keep the data in the databases?

To solve this problem, I even joined Azure Databricks and AWS Databricks. but as I ran my notebooks through those, it exceeded the limits and I got charged a lot for that, so I had to quit because I am just using it for my school project as a student (it has been very difficult to work on because of this issue). 

So I am trying to figure out if there is a way to use this CE while keeping the databases. Please let me know if you could suggest anything for me.

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @choi_2

I understand the challenges youโ€™re facing with Databricks Community Edition (CE) and the limitations it imposes on cluster usage. While CE provides a micro-cluster and a notebook environment, it does have some restrictions. 

 

Letโ€™s address your concerns:

 

Cluster Termination:

  • CE clusters are limited to 2 hours of usage and automatically terminate after that time. Unfortunately, this behaviour cannot be changed in CE.

Data Retention:

  • Unlike notebooks, data tables and databases are not retained after cluster termination. When a cluster is terminated, any data stored in memory or temporary tables associated with that cluster is lost.
  • Notebooks remain intact because they are stored separately from the cluster. However, data tables are tied to the clusterโ€™s memory and storage, so they are not preserved when the cluster terminates.

Possible Workarounds:

 

  • External Storage:
    • Consider using external storage solutions such as Azure Blob Storage or Amazon S3. You can store your large datasets there and access them from your notebooks.
    • When you create tables in Databricks, you can store them in an external location (e.g., Azure Blob Storage or S3). This way, even if the cluster terminates, the data remains accessible.
  • Delta Lake:
    • If youโ€™re working with large datasets, consider using Delta Lake, an open source storage layer that brings ACID transactions to Apache Sparkโ„ข and big data workloads.
    • Delta Lake provides features like versioning, schema enforcement, and data consistency. It allows you to manage large datasets efficiently.
  • Scheduled Jobs:
    • Set up scheduled jobs to run periodically. During cluster runtime, these jobs can load data into your tables from external storage (e.g., CSV files, Parquet files).
    • Doing this ensures that your data is always available, even if the cluster terminates.
  • Pin Clusters:
    • Although this wonโ€™t directly retain data, you can pin clusters in Databricks. Pinned clusters are not automatically terminated, so you can keep them running for longer periods.
    • However, this approach may not be ideal for large-scale data processing due to resource limitations in CE.

Considerations for Azure Databricks and AWS Databricks:

  • While these platforms offer more flexibility, they come with costs. As a student, budget constraints can be challenging.
  • If you decide to use Azure Databricks or AWS Databricks, be cautious about the resources you allocate to avoid unexpected charges.

In summary, while CE has limitations, leveraging external storage, Delta Lake, and scheduled jobs can help you manage your data effectively. Remember to balance convenience with cost considerations. Feel free to ask if you have specific use cases or need further assistance! ๐Ÿ˜Š

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @choi_2

I understand the challenges youโ€™re facing with Databricks Community Edition (CE) and the limitations it imposes on cluster usage. While CE provides a micro-cluster and a notebook environment, it does have some restrictions. 

 

Letโ€™s address your concerns:

 

Cluster Termination:

  • CE clusters are limited to 2 hours of usage and automatically terminate after that time. Unfortunately, this behaviour cannot be changed in CE.

Data Retention:

  • Unlike notebooks, data tables and databases are not retained after cluster termination. When a cluster is terminated, any data stored in memory or temporary tables associated with that cluster is lost.
  • Notebooks remain intact because they are stored separately from the cluster. However, data tables are tied to the clusterโ€™s memory and storage, so they are not preserved when the cluster terminates.

Possible Workarounds:

 

  • External Storage:
    • Consider using external storage solutions such as Azure Blob Storage or Amazon S3. You can store your large datasets there and access them from your notebooks.
    • When you create tables in Databricks, you can store them in an external location (e.g., Azure Blob Storage or S3). This way, even if the cluster terminates, the data remains accessible.
  • Delta Lake:
    • If youโ€™re working with large datasets, consider using Delta Lake, an open source storage layer that brings ACID transactions to Apache Sparkโ„ข and big data workloads.
    • Delta Lake provides features like versioning, schema enforcement, and data consistency. It allows you to manage large datasets efficiently.
  • Scheduled Jobs:
    • Set up scheduled jobs to run periodically. During cluster runtime, these jobs can load data into your tables from external storage (e.g., CSV files, Parquet files).
    • Doing this ensures that your data is always available, even if the cluster terminates.
  • Pin Clusters:
    • Although this wonโ€™t directly retain data, you can pin clusters in Databricks. Pinned clusters are not automatically terminated, so you can keep them running for longer periods.
    • However, this approach may not be ideal for large-scale data processing due to resource limitations in CE.

Considerations for Azure Databricks and AWS Databricks:

  • While these platforms offer more flexibility, they come with costs. As a student, budget constraints can be challenging.
  • If you decide to use Azure Databricks or AWS Databricks, be cautious about the resources you allocate to avoid unexpected charges.

In summary, while CE has limitations, leveraging external storage, Delta Lake, and scheduled jobs can help you manage your data effectively. Remember to balance convenience with cost considerations. Feel free to ask if you have specific use cases or need further assistance! ๐Ÿ˜Š

choi_2
New Contributor II

Thank you so much for the response, @Kaniz 

So would Azure Blob Storage, Amazon S3, and Delta Lake be free to use? If so, which one is the easiest one to for the first time user?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.