Hi there everyone,
We are trying to get hands on Databricks Lakehouse for a prospective client's project.
Our Major aim for the project is to Compare Datalakehosue on Databricks and Bigquery Datawarehouse in terms of Costs and time to setup and run queries.
We have created projects and tested in multiple data sizes (250 Gb and 1.3 Tb), and we had a great experience and are looking to build our expertise around Databricks Lakehouse.
We had some questions regarding cluster configurations. While working with 1.3 Tb data , using cluster size of 32 Gb , 4 Cores ,Personal Compute cluster. the time taken to read data(parquet) from gcp bucket and converting it into a delta table was 5+ hours. Then we did some optimisations with code and partitioned it and read it in multiple chunks and reduced the time to 3.5 hours but still when compared to Bigquery which takes 15 mins there is a huge difference.
We figured out that bigquery uses serverless compute while in databricks we are using cluster of very less size So, is there any way
- to find correct cluster configurations for specific amount of data (like calculators or rough estimates)
- any technical blogs where we can get more idea about this
- or any other tips for reducing time.
We found about serverless databricks clusters both for SQL and notebooks but I think they are supported in Paid account and we are still in our trial period.