cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks with CloudWatch metrics without Instanceid dimension

dceman
New Contributor

I have jobs running on job clusters. And I want to send metrics to the CloudWatch. I set CW agent followed this guide.

But issue is that I can't create useful metrics dashboard and alarms because I always have InstanceId dimension, and InstanceId is different on every job run. If you check the link above, you will find init script and part of the json for configuring cw agent is

{
    ...
    "append_dimensions": {
        "InstanceId": "${aws:InstanceId}"
    }

I removed this, and added custom dimension to each metric, something like this

{
    "agent": {
        "metrics_collection_interval": 10,
        "logfile": "/var/log/amazon-cloudwatch-agent.log",
        "omit_hostname": true,
        "debug": true
    },
    "logs": {
        "logs_collected": {
            "files": {
                "collect_list": [
                    {
                        "file_path": "/databricks/spark/work/*/*/stderr",
                        "log_group_name": "/databricks/$NAMESPACE",
                        "log_stream_name": "executor-stderr"
                    },
                    {
                        "file_path": "/databricks/spark/work/*/*/stdout",
                        "log_group_name": "/databricks/$NAMESPACE",
                        "log_stream_name": "executor-stdout"
                    }
                ]
            }
        }
    },
    "metrics": {
        "namespace": "$NAMESPACE",
        "metrics_collected": {
            "statsd": {
                "service_address": ":8125"
            },
            "cpu": {
                "resources": [
                    "*"
                ],
                "measurement": [
                    {
                        "name": "cpu_usage_idle",
                        "rename": "EXEC_CPU_USAGE_IDLE",
                        "unit": "Percent"
                    },
                    {
                        "name": "cpu_usage_iowait",
                        "rename": "EXEC_CPU_USAGE_IOWAIT",
                        "unit": "Percent"
                    },
                    {
                        "name": "cpu_time_idle",
                        "rename": "EXEC_CPU_TIME_IDLE",
                        "unit": "Percent"
                    },
                    {
                        "name": "cpu_time_iowait",
                        "rename": "EXEC_CPU_TIME_IOWAIT",
                        "unit": "Percent"
                    }
                ],
                "totalcpu": true,
                "append_dimensions": {
                    "node": "$EXECUTOR_ID"
                }
            },
            "disk": {
                "resources": [
                    "/"
                ],
                "measurement": [
                    {
                        "name": "disk_free",
                        "rename": "EXEC_DISK_FREE",
                        "unit": "Gigabytes"
                    },
                    {
                        "name": "disk_inodes_free",
                        "rename": "EXEC_DISK_INODES_FREE",
                        "unit": "Count"
                    },
                    {
                        "name": "disk_inodes_total",
                        "rename": "EXEC_DISK_INODES_TOTAL",
                        "unit": "Count"
                    },
                    {
                        "name": "disk_inodes_used",
                        "rename": "EXEC_DISK_INODES_USED",
                        "unit": "Count"
                    }
                ],
                "append_dimensions": {
                    "node": "$EXECUTOR_ID"
                }
            },
            "diskio": {
                "resources": [
                    "*"
                ],
                "measurement": [
                    {
                        "name": "diskio_iops_in_progress",
                        "rename": "EXEC_DISKIO_IOPS_IN_PROGRESS",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "diskio_read_time",
                        "rename": "EXEC_DISKIO_READ_TIME",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "diskio_write_time",
                        "rename": "EXEC_DISKIO_WRITE_TIME",
                        "unit": "Megabytes"
                    }
                ],
                "append_dimensions": {
                    "node": "$EXECUTOR_ID"
                }
            },
            "mem": {
                "measurement": [
                    {
                        "name": "mem_available",
                        "rename": "EXEC_MEM_AVAILABLE",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "mem_total",
                        "rename": "EXEC_MEM_TOTAL",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "mem_used",
                        "rename": "EXEC_MEM_USED",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "mem_used_percent",
                        "rename": "EXEC_MEM_USED_PERCENT",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "mem_available_percent",
                        "rename": "EXEC_MEM_AVAILABLE_PERCENT",
                        "unit": "Megabytes"
                    }
                ],
                "append_dimensions": {
                    "node": "$EXECUTOR_ID"
                }
            },
            "net": {
                "resources": [
                    "eth0"
                ],
                "measurement": [
                    {
                        "name": "net_bytes_recv",
                        "rename": "EXEC_NET_BYTES_RECV",
                        "unit": "Bytes"
                    },
                    {
                        "name": "net_bytes_sent",
                        "rename": "EXEC_NET_BYTES_SENT",
                        "unit": "Bytes"
                    }
                ],
                "append_dimensions": {
                    "node": "$EXECUTOR_ID"
                }
            }
        }
    }
}

and similar json in case init script is running on driver. In that case dimension is like

"append_dimensions": {
                    "node": "driver"
                }

I would like to have end result something like this

image 

but as you can see, I'm missing executor number sufix.

I tried to parse this env

EXECUTOR_ID="executor-$(echo $SPARK_LOG_URL_STDOUT | cut -f2 -d'&' | cut -f2 -d'=')"

but seems that I don't have this variable on executors.

Could you please suggest how to get executorId in init bash script? Is there some useful env variable I can get from executors?

Why this is important to us?

We will have a lot of jobs running every 1h. After job run, jobs cluster will be terminated.

In that case Dimension InstanceId is not useful because every time it will be different, and metrics would be not much useful.

Maybe I could use CW Metrics Query to group by InstanceId, but it's not possible to set alarm on CW Query.

0 REPLIES 0
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!