Hi there,
I came across this Databricks demo from the below link.
https://youtu.be/BqB7YQ1-KKc
Kindly Fastforward to time 16:30 or 16:45 of the video and watch few mins of the video related to cost. My understanding is the data is in the lake and databricks performed computation in top of that.
Question 1: What does he refer to as "lake"? did he mean an container and files in azure or aws storage location? I know Databricks can read from any storage location.
Question 2:
Correct me if im wrong, is my below understanding of the best practice correct to have the cost minimal by doing the below steps?
1) Make data files available in storage accounts (probably as parquet format)
2) Create notebooks to compute everything on the fly,
3) Write the processed output file or files back to storage locations,
4) Add the notebook or books to pipeline and run the pipeline
5) Automatically shutdown all clusters.
This way the Databricks cost is way less? is that right? Again plz correct me if im wrong.
Question 3:
Now does the same above methods apply to Delta lake as well? Like delta live tables, etc.? or delta is a feature applicable only as long as the data is inside databricks and not in container storage locations in azure or aws.
Question 4:
Appreciate if you could share any articles or videos which share step by step best practice to reduce cost in Databricks so I can do a small PoC and share it with my client (ingest data from api, store 30-50gb of data, how that data gets processed in pipeline, shutdown all db clusters automatically, now the data is available for reporting from containers).
As of my skillset, I have a long working history on datawarehouse, staging tables, facts, dimensions, incremental loads, partitions, indexes, etc... im just trying to make my client move into Databricks.
any best practice articles you could share would be helpful.
Thanks