cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks with CloudWatch metrics without Instanceid dimension

dceman
New Contributor

I have jobs running on job clusters. And I want to send metrics to the CloudWatch. I set CW agent followed this guide.

But issue is that I can't create useful metrics dashboard and alarms because I always have InstanceId dimension, and InstanceId is different on every job run. If you check the link above, you will find init script and part of the json for configuring cw agent is

{
    ...
    "append_dimensions": {
        "InstanceId": "${aws:InstanceId}"
    }

I removed this, and added custom dimension to each metric, something like this

{
    "agent": {
        "metrics_collection_interval": 10,
        "logfile": "/var/log/amazon-cloudwatch-agent.log",
        "omit_hostname": true,
        "debug": true
    },
    "logs": {
        "logs_collected": {
            "files": {
                "collect_list": [
                    {
                        "file_path": "/databricks/spark/work/*/*/stderr",
                        "log_group_name": "/databricks/$NAMESPACE",
                        "log_stream_name": "executor-stderr"
                    },
                    {
                        "file_path": "/databricks/spark/work/*/*/stdout",
                        "log_group_name": "/databricks/$NAMESPACE",
                        "log_stream_name": "executor-stdout"
                    }
                ]
            }
        }
    },
    "metrics": {
        "namespace": "$NAMESPACE",
        "metrics_collected": {
            "statsd": {
                "service_address": ":8125"
            },
            "cpu": {
                "resources": [
                    "*"
                ],
                "measurement": [
                    {
                        "name": "cpu_usage_idle",
                        "rename": "EXEC_CPU_USAGE_IDLE",
                        "unit": "Percent"
                    },
                    {
                        "name": "cpu_usage_iowait",
                        "rename": "EXEC_CPU_USAGE_IOWAIT",
                        "unit": "Percent"
                    },
                    {
                        "name": "cpu_time_idle",
                        "rename": "EXEC_CPU_TIME_IDLE",
                        "unit": "Percent"
                    },
                    {
                        "name": "cpu_time_iowait",
                        "rename": "EXEC_CPU_TIME_IOWAIT",
                        "unit": "Percent"
                    }
                ],
                "totalcpu": true,
                "append_dimensions": {
                    "node": "$EXECUTOR_ID"
                }
            },
            "disk": {
                "resources": [
                    "/"
                ],
                "measurement": [
                    {
                        "name": "disk_free",
                        "rename": "EXEC_DISK_FREE",
                        "unit": "Gigabytes"
                    },
                    {
                        "name": "disk_inodes_free",
                        "rename": "EXEC_DISK_INODES_FREE",
                        "unit": "Count"
                    },
                    {
                        "name": "disk_inodes_total",
                        "rename": "EXEC_DISK_INODES_TOTAL",
                        "unit": "Count"
                    },
                    {
                        "name": "disk_inodes_used",
                        "rename": "EXEC_DISK_INODES_USED",
                        "unit": "Count"
                    }
                ],
                "append_dimensions": {
                    "node": "$EXECUTOR_ID"
                }
            },
            "diskio": {
                "resources": [
                    "*"
                ],
                "measurement": [
                    {
                        "name": "diskio_iops_in_progress",
                        "rename": "EXEC_DISKIO_IOPS_IN_PROGRESS",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "diskio_read_time",
                        "rename": "EXEC_DISKIO_READ_TIME",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "diskio_write_time",
                        "rename": "EXEC_DISKIO_WRITE_TIME",
                        "unit": "Megabytes"
                    }
                ],
                "append_dimensions": {
                    "node": "$EXECUTOR_ID"
                }
            },
            "mem": {
                "measurement": [
                    {
                        "name": "mem_available",
                        "rename": "EXEC_MEM_AVAILABLE",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "mem_total",
                        "rename": "EXEC_MEM_TOTAL",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "mem_used",
                        "rename": "EXEC_MEM_USED",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "mem_used_percent",
                        "rename": "EXEC_MEM_USED_PERCENT",
                        "unit": "Megabytes"
                    },
                    {
                        "name": "mem_available_percent",
                        "rename": "EXEC_MEM_AVAILABLE_PERCENT",
                        "unit": "Megabytes"
                    }
                ],
                "append_dimensions": {
                    "node": "$EXECUTOR_ID"
                }
            },
            "net": {
                "resources": [
                    "eth0"
                ],
                "measurement": [
                    {
                        "name": "net_bytes_recv",
                        "rename": "EXEC_NET_BYTES_RECV",
                        "unit": "Bytes"
                    },
                    {
                        "name": "net_bytes_sent",
                        "rename": "EXEC_NET_BYTES_SENT",
                        "unit": "Bytes"
                    }
                ],
                "append_dimensions": {
                    "node": "$EXECUTOR_ID"
                }
            }
        }
    }
}

and similar json in case init script is running on driver. In that case dimension is like

"append_dimensions": {
                    "node": "driver"
                }

I would like to have end result something like this

image 

but as you can see, I'm missing executor number sufix.

I tried to parse this env

EXECUTOR_ID="executor-$(echo $SPARK_LOG_URL_STDOUT | cut -f2 -d'&' | cut -f2 -d'=')"

but seems that I don't have this variable on executors.

Could you please suggest how to get executorId in init bash script? Is there some useful env variable I can get from executors?

Why this is important to us?

We will have a lot of jobs running every 1h. After job run, jobs cluster will be terminated.

In that case Dimension InstanceId is not useful because every time it will be different, and metrics would be not much useful.

Maybe I could use CW Metrics Query to group by InstanceId, but it's not possible to set alarm on CW Query.

0 REPLIES 0
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.