Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
hi,I would like to request assistance on how to collect usage metrics and job execution data for my Databricks environment. We are currently not using Unity Catalog, but I would still like to monitor and analyze usageCould you please provide guidance...
I'm having difficulty adding a mask function to columns while creating streaming tables using the DLT Python method create_streaming_table() like this but it does not work, the streaming table is created but no column is masked:def prepare_column_pro...
@NamNguyenCypher Delta Live Tables’ Python API does not currently honor column-mask metadata embedded in a PySpark StructType. Masking (and row filters) on DLT tables are only applied when you define your table with a DDL-style schema that includes a...
How to use change data feed when delta table schema changes between delta table versions?I tried to read change data feed in parts (in code snippet I read version 1372, because 1371 and 1373 schema versions are different), but getting errorUnsupporte...
@LasseL When you read from the change data feed in batch mode, Delta Lake always uses a single schema:By default, it uses the latest table version’s schema, even if you’re only reading an older versionOn Delta Runtime ≥ 12.2 LTS with column mapping e...
Hello, Suddenly since last night on some of our DLT pipelines we're getting failures saying that our hive_metastore control table cannot be found. All of our DLT's are set up the same (serverless), and one Shared Compute on runtime version 15.4. For ...
I am using delta live table and pub sub to ingest message from 30 different topics in parallel. I noticed that initialization time can be very long around 15 minutes. Does someone knows how to reduced initialization time in dlt ? Thanks You
Classic clusters can take up to seven minutes to be acquired, configured, and deployed, with most of this time spent waiting for the cloud service to allocate virtual machines. In contrast, serverless clusters typically start in under eight seconds. ...
I'm trying to calculate the cost of a job using the usage and list_prices system tables, but I'm encountering some unexpected behavior that I can't explain.When I run a job using a shared cluster, the sku_name in the usage table is PREMIUM_JOBS_SERVE...
I need to get my compute metric, not from the UI...the system tables has not much informations, node_timeline has per minute record metric so it's difficult to calculate each compute CPU usage per day. Any way we can get the CPU usage,CPU idle time,M...
To calculate CPU usage, CPU idle time, and memory usage per cluster per day, you can use the system.compute.node_timeline system table. However, since the data in this table is recorded at per-minute granularity, it’s necessary to aggregate the data ...
Hi Guys,I am trying to use DLT Publish event log to metastore feature.and I noticed it creates a table with the logs for each DLT pipelines separately. Does it mean it maintains the separate log table for all the DLT tables ( in our case, we have 100...
Hi @ankit001mittal Yes, you're right, when you enable the "Publish Event Log to Metastore" option for DLT pipelines, Databricks creates a separate event log table for each pipeline. So, if you have thousands of pipelines, you'll see thousands of log ...
I have a parent job that calls multiple child jobs in workflow, Out of 10 child jobs, one has failed and rest 9 are still running, I want to repair the failed child tasks. can I do that while the other child jobs are running?
Hi holychs,How are you doing today?, As per my understanding, yes, in Databricks Workflows, if you're running a multi-task job (like your parent job triggering multiple child tasks), you can repair only the failed task without restarting the entire j...
If there are two different DABs. Can we have a dependency for one job from one DAB to run after a job run from another DAB? Similar to how tasks can depend on each other to run one after the other in the same DAB. Can we have the same for two differ...
@vivi007 Yes, you can create dependencies between jobs in different DABs (Databricks Asset Bundles), but this requires a different approach than task dependencies within a single DAB.Since DABs are designed to be independently deployable units, direc...
Hi All,I am trying to deploy a DBX APP via DAB, however source_code_path seems not to be parsed correctly to the app configuration.- dbx_dash/-- resources/---- app.yml-- src/---- app.yaml---- app.py-- databricks.ymlresources/app.yml:resources:apps: m...
Hello,I am learning to create DLT pipelines using different graphs using a 14 day trial version of the premium Databricks. I have currently one graph Mat view -> Streaming Table -> Mat view. When i ran this pipeline(serverless compute) 1st time, ran...
dbutils.fs.mv with ADLS currently copies the file and then deletes the old one. This incurs costs and has a lot of overhead vs using the rename functionality in ADLS which is instant and doesn't incur extra costs involved with writing the 'new' data....
The tool is really meant for dbfs and is only accessible from within Databricks. If I had to guess the idea is that most folks will not be using dbfs for production or sensitive data (for a host of good reasons) and as such there has not been a big ...
Hi everyone,I was looking into the databricks_workspace_conf Terraform resource to configure Verbose Audit Logs (and avoid changing it through the UI). However, I attempted to apply this configuration and encountered the following error:Error: cannot...
@fedemgp I was able to turn the desired setting on and off with Terraform with this code: GitHub Gist I'm using Databricks Terraform provider version 1.74.0 and my Databricks runs on Google Cloud.
Hi guys,I can see that in DLT pipelines we have query history section where we can see the duration of each tables and number of rows read.Is this information stored somewhere in the systems catalog? can I query this information?