Databricks Community

ashraf1395 · ‎09-05-2024

We have to deliver a Databricks Finops Assessment project. I am trying to write a proposal for it. I haven't done one before. I have created a general process of how the assessment will look like and then restructured it using gpt.

Plz give your feedback on it what other things can be done or what changes are needed in the process. As someone who can done hands on cost optimization project will be able to review it much better.
Client details
- 5 -10 Tb data
- Cloud Platform: Azure
- Data Unity Catalog not set + bad finops practice

Here is the process

Databricks FinOps: Assessment & Discovery Process

---

1. Initial Kickoff and Stakeholder Engagement

Objective: Understand business goals and define the scope of the assessment.

- Stakeholder Identification: Identify key stakeholders from engineering, finance, and operations.
- Goal Setting: Define the objectives (e.g., reducing cloud spend, improving resource efficiency, cost attribution).
- Data Collection Timeline: Establish a timeline for collecting necessary usage and cost data from Databricks and the cloud provider.

2. Cost Visibility and Breakdown

Objective: Gain a granular understanding of current Databricks costs.

- Collect Cost Data:
- Pull reports from Databricks’ usage dashboard (DBUs, cluster usage).
- Extract billing data from cloud providers (e.g., AWS, Azure, GCP).
- Identify all Databricks Units (DBUs) being consumed by clusters, jobs, and notebooks.
- Review compute (VMs on the cloud) and storage costs (data stored in Delta Lake and backups).

- Tagging Audit:
- Ensure all resources (clusters, jobs, notebooks) are tagged with appropriate labels for cost attribution (project, department, environment).
- Identify gaps in tagging and misattributions.

3. Architecture and Cluster Review

Objective: Understand the current Databricks setup and how resources are being used.

- Cluster Configuration Review:
- Examine cluster types (standard, autoscaling, high-concurrency).
- Analyze the cluster autoscaling behavior and determine if resources are overprovisioned or underutilized.

- Job Scheduling Audit:
- Review job runtimes, frequencies, and resource consumption.
- Identify jobs that are running inefficiently or redundantly.
- Determine if auto-terminating clusters are enabled for jobs to avoid idle time.

- Storage Audit:
- Review storage consumption in Delta Lake and cloud object storage.
- Assess whether data retention policies and tiered storage strategies are being applied.

4. Benchmarking and Performance Review

Objective: Establish baseline performance metrics and identify variances.

- Cost Benchmarking:
- Compare current costs to historical data and establish trends in spend.
- Perform internal team benchmarking to identify departments/projects driving the most costs.

- Performance Benchmarking:
- Analyze cluster performance and identify bottlenecks (e.g., long warm-up times, underutilized resources).
- Assess whether the use of Photon Engine or Delta Cache is being optimized for faster queries and lower costs.

- Industry Peer Benchmarking:
- Compare Databricks costs with industry benchmarks to determine how your usage compares to similar businesses.

5. Cost Attribution and Budgeting

Objective: Ensure costs are accurately allocated to teams and projects for proper financial accountability.

- Review of Current Budgets:
- Assess if team-level or project-level budgets are in place and being tracked.
- Analyze if current spend is aligned with forecasts or if there are deviations.

- Granular Cost Allocation:
- Ensure that all Databricks workloads are properly tagged for detailed cost reporting.
- Allocate costs based on department, team, or project, making sure there’s visibility into who is consuming resources.

6. Findings & Report Generation

Objective: Consolidate findings into a clear, actionable report.

- Cost Overview: Provide a detailed breakdown of Databricks spend, including DBUs, compute, and storage.
- Cluster & Job Analysis: Summarize inefficiencies in cluster usage and job scheduling.
- Benchmarking Results: Highlight key findings from performance benchmarking, comparing internal teams and industry peers.
- Recommendations: Provide preliminary optimization recommendations, including changes to cluster sizing, job scheduling, and data storage strategies.
- Next Steps: Define actionable steps for the optimization phase.

Thanks,
Ashraf

Databricks Community

Databricks Finops Assessment

Photos

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April