I have a Bronze -> Silver -> Gold architecture for my ETL pipelines and all tables are Delta. I'm trying to understand what updates flow downstream when I make changes to the source table. Most importantly, if I run optimize on the source, does every...
While using MERGE INTO statement, if the source data that will be merged into the target delta table is small enough to be fit into memory of the worker nodes, then it makes sense to broadcast the source data. By doing so, the execution can avoid the...
Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design. When Apache Spa...
It's recommended to use the overwrite option. Overwrite the table data and run a VACUUM command. To Delete the data from a Managed Delta table, the DROP TABLE command can be used. If it's an external table, then run a DELETE query on the table and th...
Customer wants to understand our strategy for breaking cluster logs into different partitions and files. They want to be able to ingest these logs into a tool that needs to understand this. They have indicated that the logs used to all be in one file...
Best practices for Databricks pools — Databricks DocumentationLearn best practices for configuring and using Databricks pools.https://docs.databricks.com/clusters/instance-pools/pool-best-practices.htmlBest practices for Azure Databricks pools - Azur...
Pre-emption is by default turned on Databricks cluster. Turning on or turning off pre-emption would make more sense on a high concurrency cluster. Pre-emption ensures that the job starting for resources gets a fair share of the resource available on ...
1) Find the corresponding library definition from an existing cluster using "libraries/cluster-status?cluster_id=<cluster_id>".$ curl -X GET 'https://cust-success.cloud.databricks.com/api/2.0/libraries/cluster-status?cluster_id=1226-232931-cuffs129' ...
I have a high concurrency cluster where multiple users are running. However, I see the queries are running very slow. I did debug the logs and see more time is spent on the Spark driver. on the Spark UI, I do not see slowness.
It's possible the connectivity to hive metastore is causing the delay here. When there is a high degree of concurrency and contention for metastore access. Interactive clusters in DBR are configured to use up to 5 (spark.databricks.hive.metastore.cli...
Yes it is correct - Every commits to the delta create a version so definitely each micro batch create a version More Info -: https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
I have data written in Delta on ADLS. As I understand the delta also internal file in parquet format but when Iread the file in different format I got different record countspark.read.parquet() or spark.read.format('delta').load()df = spark.read.for...
I think you have written in delta twice using overwrite mode .But Delta is versioned data format - when you use overwrite, it doesn't delete previous data, it just writes new files, and don't delete files immediately - they are just marked as delete...
Databricks does not charge DBUs while instances are idle in the pool. Instance provider billing does apply.Please refer here for more information - https://docs.databricks.com/clusters/instance-pools/index.html