Datadog, OpenTelemetry, and Databricks container service
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-18-2024 01:00 PM
We have successfully gotten Datadog agent(s) installed and running on databricks clusters via init script - this part seems to be working fine. We are working on instrumenting our jobs using the OpenTelemetry endpoint feature of the Datadog agent, which requires being able to communicate with the agent over http (there is also a socket option, but we would prefer http). This works fine when running directly on the driver and worker nodes.
However, we are running our jobs using databricks container service and the processes in the container seem to be unable to access the host instance where the agent is running.
Has anyone found a solution or workaround for this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-18-2024 03:49 PM - edited 11-18-2024 03:49 PM
The agent installations via the init script would install the agents in the Spark containers (All user workloads + spark processes run in the container). The users don't have direct access to the host machine and can't install any agents. You may need to enable cluster log delivery to see if the init script execution logs. Also, try simple commands via notebook job to check the agent status and any other diagnostics related to these agents. Also, the cluster mode may matter here - Shared/User isolation mode may have more restrictions in compared to single user clusters.In case you have cluster mode differing between test cluster and your job cluster.

