cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Federate AWS Cloudwatch logs to Databricks Unity Catalog

Vetrivel
Contributor

I am looking to integrate CloudWatch logs with Databricks. Our objective is not to monitor Databricks via CloudWatch, but rather to facilitate access to CloudWatch logs from within Databricks. If anyone has implemented a similar solution, kindly provide guidance.

#cloudwatch #federation

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

To access CloudWatch logs from within Databricks, you can set up an integration that enables Databricks to fetch, query, and analyze AWS CloudWatch log data directlyโ€”without configuring CloudWatch to monitor Databricks clusters. This approach is increasingly popular for audit, operational, and troubleshooting purposes, especially when logs are stored in AWS and need to be analyzed in Databricks for downstream data engineering or analytics workflows.โ€‹

Recommended Integration Methods

The most common methods to achieve CloudWatch logs access from within Databricks are:

  • Using AWS SDKs (boto3 for Python):

    • Install the boto3 Python library in your Databricks notebooks.

    • Set up appropriate AWS credentials via environment variables or Databricks secrets.

    • Use boto3 to programmatically query CloudWatch logs using CloudWatch Logs APIs (filter_log_events, etc.), then load the results into a DataFrame for analysis.โ€‹

    • Example snippet:

      python
      import boto3 client = boto3.client('logs', region_name='your-region') response = client.filter_log_events( logGroupName='your-log-group', filterPattern='[pattern]', startTime=int(start_timestamp), endTime=int(end_timestamp) ) # Convert response['events'] to Pandas or Spark DataFrame for further processing
    • This method allows you to run queries, extract logs, and perform transformations using Databricks native tools.

  • Centralized Log Storage in S3:

    • Export CloudWatch logs to Amazon S3 using AWS native features or by creating log subscriptions.

    • Databricks can read logs directly from S3 using Sparkโ€™s built-in capabilities.โ€‹

    • This decouples log storage from CloudWatch and leverages Databricksโ€™ efficiency in reading large files from S3.

  • Federation with Databricks Unity Catalog:

    • In advanced setups, CloudWatch or S3-exported logs can be ingested into Databricks Unity Catalog tables via custom pipelines with Spark or through partner tools such as Telegraf (which can push logs into Databricks using REST APIs).โ€‹

Best Practices and Configuration

  • Assign appropriate IAM roles to enable Databricks clusters to read logs from CloudWatch (via APIs) or S3.

  • Secure credentials and follow least privilege principles when configuring cross-service access.โ€‹

  • Consider batching and filtering queries, as CloudWatch API limits might affect performance.

  • For real-time analysis, schedule regular jobs or use Delta Lake to store and refresh ingested logs within Databricks.โ€‹

Useful Resources

  • AWS documentation: โ€œHow to Monitor Databricks with Amazon CloudWatchโ€โ€”contains guidance on credentials and agent setup (even if youโ€™re not monitoring Databricks itself).โ€‹

  • Databricks Community: Thread on sending logs from Databricks to CloudWatch, with comments on how native integrations are limitedโ€”and that boto3 and S3 exports are preferred for custom use cases.โ€‹

  • InfluxData/Telegraf: Describes a workflow to ingest logs from CloudWatch directly into Databricks for analytics.โ€‹

By using one of these supported approaches, Databricks can efficiently access, process, and analyze CloudWatch logs in a secure and scalable way without native direct integration.โ€‹