We have to deliver a Databricks Finops Assessment project. I am trying to write a proposal for it. I haven't done one before. I have created a general process of how the assessment will look like and then restructured it using gpt.
Plz give your feedback on it what other things can be done or what changes are needed in the process. As someone who can done hands on cost optimization project will be able to review it much better.
Client details
- 5 -10 Tb data
- Cloud Platform: Azure
- Data Unity Catalog not set + bad finops practice
Here is the process
Databricks FinOps: Assessment & Discovery Process
---
1. Initial Kickoff and Stakeholder Engagement
Objective: Understand business goals and define the scope of the assessment.
- Stakeholder Identification: Identify key stakeholders from engineering, finance, and operations.
- Goal Setting: Define the objectives (e.g., reducing cloud spend, improving resource efficiency, cost attribution).
- Data Collection Timeline: Establish a timeline for collecting necessary usage and cost data from Databricks and the cloud provider.
2. Cost Visibility and Breakdown
Objective: Gain a granular understanding of current Databricks costs.
- Collect Cost Data:
- Pull reports from Databricks’ usage dashboard (DBUs, cluster usage).
- Extract billing data from cloud providers (e.g., AWS, Azure, GCP).
- Identify all Databricks Units (DBUs) being consumed by clusters, jobs, and notebooks.
- Review compute (VMs on the cloud) and storage costs (data stored in Delta Lake and backups).
- Tagging Audit:
- Ensure all resources (clusters, jobs, notebooks) are tagged with appropriate labels for cost attribution (project, department, environment).
- Identify gaps in tagging and misattributions.
3. Architecture and Cluster Review
Objective: Understand the current Databricks setup and how resources are being used.
- Cluster Configuration Review:
- Examine cluster types (standard, autoscaling, high-concurrency).
- Analyze the cluster autoscaling behavior and determine if resources are overprovisioned or underutilized.
- Job Scheduling Audit:
- Review job runtimes, frequencies, and resource consumption.
- Identify jobs that are running inefficiently or redundantly.
- Determine if auto-terminating clusters are enabled for jobs to avoid idle time.
- Storage Audit:
- Review storage consumption in Delta Lake and cloud object storage.
- Assess whether data retention policies and tiered storage strategies are being applied.
4. Benchmarking and Performance Review
Objective: Establish baseline performance metrics and identify variances.
- Cost Benchmarking:
- Compare current costs to historical data and establish trends in spend.
- Perform internal team benchmarking to identify departments/projects driving the most costs.
- Performance Benchmarking:
- Analyze cluster performance and identify bottlenecks (e.g., long warm-up times, underutilized resources).
- Assess whether the use of Photon Engine or Delta Cache is being optimized for faster queries and lower costs.
- Industry Peer Benchmarking:
- Compare Databricks costs with industry benchmarks to determine how your usage compares to similar businesses.
5. Cost Attribution and Budgeting
Objective: Ensure costs are accurately allocated to teams and projects for proper financial accountability.
- Review of Current Budgets:
- Assess if team-level or project-level budgets are in place and being tracked.
- Analyze if current spend is aligned with forecasts or if there are deviations.
- Granular Cost Allocation:
- Ensure that all Databricks workloads are properly tagged for detailed cost reporting.
- Allocate costs based on department, team, or project, making sure there’s visibility into who is consuming resources.
6. Findings & Report Generation
Objective: Consolidate findings into a clear, actionable report.
- Cost Overview: Provide a detailed breakdown of Databricks spend, including DBUs, compute, and storage.
- Cluster & Job Analysis: Summarize inefficiencies in cluster usage and job scheduling.
- Benchmarking Results: Highlight key findings from performance benchmarking, comparing internal teams and industry peers.
- Recommendations: Provide preliminary optimization recommendations, including changes to cluster sizing, job scheduling, and data storage strategies.
- Next Steps: Define actionable steps for the optimization phase.
Thanks,
Ashraf