cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Optimising Clusters in Databricks on GCP

ashraf1395
New Contributor III

Hi there everyone,

We are trying to get hands on Databricks Lakehouse for a prospective client's project.

Our Major aim for the project is to Compare Datalakehosue on Databricks and Bigquery Datawarehouse in terms of Costs and time to setup and run queries.

We have created projects and tested in multiple data sizes (250 Gb and 1.3 Tb), and we had a great experience and are looking to build our expertise around Databricks Lakehouse.

We had some questions regarding cluster configurations. While working with 1.3 Tb data , using cluster size of 32 Gb , 4 Cores ,Personal Compute cluster. the time taken to read data(parquet) from gcp bucket and converting it into a delta table was 5+ hours. Then we did some optimisations with code and partitioned it and read it in multiple chunks and reduced the time to 3.5 hours but still when compared to Bigquery which takes 15 mins there is a huge difference.

We figured out that bigquery uses serverless compute while in databricks we are using cluster of very less size So, is there any way 

- to find correct cluster configurations for specific amount of data (like calculators or rough estimates)

- any technical blogs where we can get more idea about this

- or any other tips for reducing time. 

We found about serverless databricks clusters both for SQL and notebooks but I think they are supported in Paid account and we are still in our trial period.

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @ashraf1395, Comparing Databricks Lakehouse and Google BigQuery is essential to make an informed decision for your project.

Let’s address your questions:

  1. Cluster Configurations for Databricks:

    • Databricks provide flexibility in configuring compute clusters. To determine optimal settings, consider the following factors:
      • Total Executor Cores: The total number of cores across all executors, affecting parallelism.
      • Total Executor Memory: The total RAM across all executors, determining in-memory data storage capacity.
      • Executor Local Storage: The type and amount of local disk storage (used for spills during shuffles and caching).
      • Worker Instance Type and Size: These also influence the above factors.
    • Balancing the number of workers and worker instance types is crucial. For instance, configuring two workers with 40 cores and 100 GB RAM is equivalent to configuring ten ...1.
    • Databricks also offers serverless computing, which automatically scales based on workload without manual configuration. Consider using serverless compute or predefined compute policies1.
  2. Technical Blogs and Resources:

  3. Reducing Query Time:

    • Optimize your code further by considering:
    • Keep in mind that BigQuery’s serverless architecture inherently provides faster query times, but Databricks offers more flexibility and additional features.
  4. Serverless Databricks Clusters:

    • While serverless clusters are available in paid accounts, you can still explore them during your trial period.
    • Evaluate whether the benefits of serverless computing justify the cost once you transition to a paid account.

Remember that Databricks and BigQuery have different architectures and trade-offs. Databricks emphasizes flexibility, while BigQuery prioritizes ease of use and performance. Consider your specific use case and requirements when making your decision3.

Good luck with your project! 🚀

 

View solution in original post

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @ashraf1395, Comparing Databricks Lakehouse and Google BigQuery is essential to make an informed decision for your project.

Let’s address your questions:

  1. Cluster Configurations for Databricks:

    • Databricks provide flexibility in configuring compute clusters. To determine optimal settings, consider the following factors:
      • Total Executor Cores: The total number of cores across all executors, affecting parallelism.
      • Total Executor Memory: The total RAM across all executors, determining in-memory data storage capacity.
      • Executor Local Storage: The type and amount of local disk storage (used for spills during shuffles and caching).
      • Worker Instance Type and Size: These also influence the above factors.
    • Balancing the number of workers and worker instance types is crucial. For instance, configuring two workers with 40 cores and 100 GB RAM is equivalent to configuring ten ...1.
    • Databricks also offers serverless computing, which automatically scales based on workload without manual configuration. Consider using serverless compute or predefined compute policies1.
  2. Technical Blogs and Resources:

  3. Reducing Query Time:

    • Optimize your code further by considering:
    • Keep in mind that BigQuery’s serverless architecture inherently provides faster query times, but Databricks offers more flexibility and additional features.
  4. Serverless Databricks Clusters:

    • While serverless clusters are available in paid accounts, you can still explore them during your trial period.
    • Evaluate whether the benefits of serverless computing justify the cost once you transition to a paid account.

Remember that Databricks and BigQuery have different architectures and trade-offs. Databricks emphasizes flexibility, while BigQuery prioritizes ease of use and performance. Consider your specific use case and requirements when making your decision3.

Good luck with your project! 🚀

 

ashraf1395
New Contributor III

Thankyou so much Kaniz.
These resources will really help to optimise my clusters. Will reach out if I face any issues

Kaniz
Community Manager
Community Manager

Hi @ashraf1395, You're welcome! I'm glad the resources are helpful for optimizing your clusters. If you encounter any issues or have any questions in the future, feel free to reach out. I'm here to help.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!