Databricks Community

eeezee · ‎12-18-2023

Introduction

Databricks provides a powerful platform for building and running big data analytics and AI workloads in the cloud. However, as with any complex system, issues can arise. Effective monitoring and observability are essential for maintaining the reliability and efficiency of Databricks operations. In this post, we will discuss common use cases of monitoring and observability across businesses and some key capabilities you can leverage within Databricks.

Data Security and Compliance

Data security and compliance are essential in analytics, especially when dealing with sensitive or regulated data. Continuous monitoring is essential for identifying unauthorized access, unusual behavior, and data breaches, ensuring data security and compliance with industry standards and regulations, allowing businesses to act quickly against such issues.

Tip #1 Enable Unity Catalog to make use of System Tables

With Unity Catalog, you can access the audit logs in system tables (public preview) directly from Databricks. Once enabled, the following Databricks system tables that are particularly useful for security and compliance are Table and column lineage and Audit logs.

SELECT event_time, 
     action_name, 
     user_identity.email AS requester,
     request_params
FROM system.access.audit
WHERE action_name IN ('updatePermissions','updateSharePermissions')
     AND audit_level = 'ACCOUNT_LEVEL'
     AND service_name = 'unityCatalog'
ORDER BY event_time DESC
LIMIT 1000

Tip #2 Configure Audit Logs to Send to Storage

If you don't have Unity Catalog, you can configure audit logs to send straight to storage for later analysis, which is particularly useful. Audit logs are written as JSON files to your storage sink. You can consume and query directly from the location by loading it into a Spark DataFrame.

val df = spark.read.format("json").load("s3a://bucketName/path/to/auditLogs")
df.createOrReplaceTempView("audit_logs")

If you want help to set this up we recommend reading Monitoring Your Databricks Lakehouse Platform with Audit Logs. For more information on how to analyse audit logs, there are many examples in the official documentation. If you prefer, you can also export logs to event hubs or Azure log analytics.

#Tip 3 Enable Verbose Audit Logs to monitor Notebook Commands

You can also enable verbose audit logs to monitor notebook commands, which can help prevent storing of sensitive information. To learn more, take a look at Monitoring Notebook Command Logs With Static Analysis Tools

Tip #4 Run the Security Analysis Tool

The Security Analysis Tool (SAT) assesses your workspace configuration for security exploits and gives recommendations to users. Data across any of the configured workspaces can be surfaced through a single pane of SQL Dashboard.

Tip #5 For companies with heightened security and compliance requirements consider the Enhanced Security and Compliance add-on

The Enhanced Security and Compliance add-on includes the Enhanced Security Monitoring (ESM) feature. This will incorporate additional agents, whose events are included in system tables, enabling customers to monitor for any suspicious activity on the hosts.

Cost Control

Effective cost control in a cloud environment is essential for budgeting purposes. It allows organizations to allocate financial resources efficiently and predict monthly expenses accurately. By managing cloud costs, businesses can establish and maintain a budget that aligns with their financial goals, preventing unexpected financial burdens and promoting fiscal responsibility.

Tip #1 Tag resources

Use tags to assign clusters to cost centers for your business. This enables you to effectively monitor spend and perform team chargebacks. There are two options to ensure users only use clusters with the dedicated tag for their cost center.

Either an admin spins up a cluster with the required tag and grants specific users access to this cluster only.
Or the admin uses cluster policies to define the type and size of cluster a user can spin up. Cluster policies can also allow to set rules for tags, therefore ensuring that the correct tag is automatically applied.

You should also use job tags to track costs of automated jobs/workflows.

Tip #2 Enable Unity Catalog to make use of System Tables

As mentioned in Security and Compliance, you can enable System Tables if you are using Unity Catalog. There are specific datasets for Billing Usage and Pricing as well as some pre-created dashboards available on our demo site.

Tip #3 Consider Overwatch for some Scenarios

Overwatch, a tool from Databricks Labs is a great choice for customers wanting to surface cost metrics but don’t have Unity Catalog. The free tool sets up daily jobs to collect data and provides dashboards to surface it.

Tip #4 Leverage your Databricks Admin

Your account admin will be able to access usage monitoring through Account Console/API which is aggregated to the workspace level on the console. They can also get more granular cluster information by configuring daily usage log delivery. This will be delivered as a CSV file to storage, which can be retained for as long as needed. Alternatively, you could configure billable usage events in your audit log delivery.

Tip #5 Be Proactive

To achieve effective cost control, Databricks recommends a proactive approach. Best Practices for Cost Management on Databricks has a number of tips to help here. There are a number of options to consider when collecting information around cost control:

Efficiency of Resources

In order to maximize the value delivered, monitoring and controlling costs alone would not be sufficient; you also need to ensure your resources are utilized effectively. Monitor resource utilization across clusters and correlate with job metrics to right-size clusters. Analyze historical usage patterns and performance trends to prevent over or under provisioning, and scale resources as needed based on data.

Tip #1 Use the metrics tab on the cluster details page

For runtime 13.0 and above, you can view historical metrics at the cluster and node level using the native Databricks UI. This includes monitoring the resources with metrics, including

Hardware: CPU, Memory, Filesystem, Network
Spark: Task status, duration and shuffles
GPU, if the instance is GPU-enabled.

Below runtime 13.0 cluster metrics were gathered using Ganglia.

Tip #2 Be aware SQL Warehouses have their own Monitoring tab

SQL Warehouses have their own monitoring tab where you can get statistics about the running workloads. Such as

Running and Queued Queries
Cluster count
Query history including duration and status

Tip #3 Surface metrics using a Dashboard or 3rd Party

You can also install agents on cluster nodes e.g. Datadog to send metrics to a third party account and make use of cloud vendor tools like Azure Monitor, AWS CloudWatch, and GCP Monitoring to provide high-level monitoring for cluster nodes.

Tip #4 Track Changes to Configuration over time

Enable Unity Catalog to make use of System Tables where you can check the history of your cluster configuration using Cluster and Node System Tables.

SELECT cluster_id, cluster_name, create_time, delete_time, change_time, tags 
FROM system.compute.clusters
WHERE owned_by ='emma.humphrey@databricks.com'
ORDER BY change_time DESC

Tip #5 Know where to get additional information

Use the ‘Driver logs’ and ‘Spark UI’ tabs on the cluster to dive into driver and worker logs.

Consider delivering driver and worker logs to DBFS, if you need to consume them via other tools, or need to retain them longer than 30 days.

If you are using Delta Live Tables (DLT) you can use the DLT event log to get additional information such as

Operational Issue Detection

Logs and metrics provide operational insights such as visibility into job failures, errors, and warnings that help diagnose and debug issues. Monitoring can help identify performance bottlenecks, slow-running jobs, and resource contention issues that impact job execution times. Critical for meeting SLAs and the ability of a system to adapt to changes in load it can also help identify opportunities to optimize pipelines, and improve platform stability and performance.

Tip #1 Design and manage notifications

An administrator possesses the capability to establish and oversee alert destinations, such as Slack, PagerDuty, and Webhooks. These options facilitate seamless notification delivery. Additionally, the Databricks platform empowers users to expose supplementary data via APIs, event hooks in DLT, and custom code. It is imperative for your business to formulate a comprehensive plan outlining the methodology for notification delivery.

Tip #2 Set up alerts for critical workflows and processes

Typically, orchestration tools schedule and trigger workloads, and any error messages encountered are raised and reported to the orchestrator. Databricks Workflows is the native orchestrator for Data & AI workloads, which includes a number of features for monitoring jobs and workflows, including a real-time insights dashboard, advanced task tracking, and alerting capabilities. For example, you can configure the expected duration for a job and get a notification when it's slow.

You can read about all the latest work we have done in Never Miss a Beat: Announcing New Monitoring and Alerting capabilities in Databricks Workflows.

You don’t just need to be in a workflow to make use of notifications. You can use the DBSQL UI to create alerts off operational event logs and send a notification when a condition is met.

Delta Live Tables (DLT) also has a lot of information available in the UI as well as an event log you can query. You can also get email notifications for pipeline events

Tip #3 Make sure you have a workflow dashboard that meets your needs

The job runs UI can break down the job into its respective tasks to help quickly troubleshoot or identify a stuck task.

You can also get a high level summary of your workflows.

If you have a large number of pipelines, you may want to also export data via the jobs api to delta tables and create your own custom dashboard. Here you can show the SLA’s of all pipelines and how many are breaching and customise it to meet your needs.

Tip #4 Identify Bottlenecks and Optimize Configuration

You can use the driver and worker logs to track the execution of your code. Worker logs give insights such as

see if a job or piece of code is being bottlenecked by CPU, memory, or disk I/O.
see how different configuration settings affect the performance of your jobs.

In addition to this the cluster event log displays important lifecycle events like creation, termination, and configuration edits.

Tip #5 Use Query Profile to help Optimize long running queries and queries that run often

If using SQL Warehouses you can use the query profile to help understand performance. You can see each query task and its metrics, like time, rows processed, and memory use. This helps find slowdowns and assess query changes. Also, it helps catch SQL errors like exploding joins or full table scans. It’s important to focus both on long running queries and those that may appear to run quickly but execute many times. A 1 sec cost saving on a query that runs every minute can have more of an impact than reducing a 30 minute query by 10 minutes when it only runs once per day.

Data Quality

As well as monitoring the platform itself, you will likely hold key metrics within the platform which connect to critical business Key Performance Indicators (KPIs). These metrics are the pulse of your operations, reflecting the success of your strategic objectives and business goals. High-quality data provides a solid foundation for making informed and accurate decisions. Decision-makers rely on data to assess trends, forecast outcomes, and set strategies. Poor data quality can lead to faulty decisions or breach in compliance, which can have significant consequences.

Tip #1 Monitor your data not just your processes

Databricks Lakehouse Monitoring lets you monitor the statistical properties and quality of the data in all of the tables in your account, as well as the performance of machine learning models and model-serving endpoints. This allows you to react to changes in the data which could be due to data quality issues or drift.

If you are in a region where Lakehouse Monitoring isn’t yet supported, don’t forget you can also use DBSQL Alerts to send notifications when metrics fall below defined thresholds.

Tip #2 Set Expectation rules to detect quality issues in the data

If you are using Delta Live Tables (DLT) then you can use expectations to manage data quality. You can choose whether to drop, warn or quarantine rows that violate the expectations, or fail the pipeline altogether. The violations are then reported in the logs so the results can be queried.

CREATE TEMPORARY LIVE TABLE report_compare_tests(
  CONSTRAINT no_missing_records EXPECT (r.key IS NOT NULL)
)
AS SELECT * FROM LIVE.validation_copy v
LEFT OUTER JOIN LIVE.report r ON v.key = r.key

If you don’t yet use DLT, you can also set up basic constraints to control quality.

Tip #3 Track Data Quality with a custom dashboard

If you use DLT, Databricks have a demo that includes a Data Quality Stats Dashboard. If you don’t use DLT you can do something similar using your own queries.

Tip #4 Embrace the Lakehouse architecture

The lakehouse is designed with data quality in mind, it helps prevent data duplication and drift when using separate data warehouses and lakes. This page is an excellent summary of the data quality principles and how they have been applied to the lakehouse. As data moves along the medallion architecture it's important to monitor data quality at every step, applying the techniques suggested.

Conclusion

Monitoring and observability are critical for operating Databricks efficiently, minimizing issues, and satisfying compliance requirements.

Databricks provides a variety of tools and features to help you surface monitoring data. Make sure you are

Collecting data that meets your monitoring and observability needs.
Making the data accessible via dashboards (built in or custom). You can have both a real time health view and historic trend analysis. Trending metrics over time can help you to identify long-term trends and patterns, this can surface both problems and opportunities.
Turn on notification where possible and use Databricks SQL to configure Alerts, so that you can be notified of important events or trends. Alerts can help with early detection of issues or changes.

With the right observability in place, you can get the most out of your Databricks investment.

Databricks Community

Navigating the Waters of Lakehouse Monitoring and Observability

Introduction

Data Security and Compliance

Cost Control

Efficiency of Resources

Operational Issue Detection

Data Quality

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL