Data Engineering

Forum Posts

Sorted by:

by anhnnguyen • New Contributor II

Saturday

56 Views
3 replies
2 kudos

Materialized view always load full table instead of incremental

My delta table are stored at HANA data lake file and I have ETL configured like below@DP.materialized_view(temporary=True) def source(): return spark.read.format("delta").load("/data/source") @dp.materialized_view def sink(): return spark.re...

Data Engineering

56 Views
3 replies
2 kudos

Saturday

View Replies

Latest Reply

anhnnguyen
New Contributor II

2 hours ago

2 kudos

hi @Yogesh_Verma_ @GaweL after removing temporary=True, pipeline still full recompute every run even though there is no change in source

2 kudos

2 hours ago

2 More Replies

by Richard3 • Visitor

2 hours ago

27 Views
0 replies
0 kudos

IDENTIFIER in SQL Views not supported?

Dear community,We are phasing out the dollar param `${catalog_name}` because it has been deprecated since runtime 15.2.We use this parameter in many queries and should now be replaced by the IDENTIFIER clause.In the query below where we retrieve data...

Data Engineering

27 Views
0 replies
0 kudos

2 hours ago

by hidden • New Contributor II

3 hours ago

11 Views
0 replies
0 kudos

replicate the behaviour of DLT create auto cdc flow

I want to custom write the behaviour of DLT create auto cdc flow . how can we do it

Data Engineering

11 Views
0 replies
0 kudos

3 hours ago

by ismaelhenzel • Contributor II

Friday

109 Views
5 replies
4 kudos

Resolved! delta live tables - collaborative development

I would like to know the best practice for collaborating on a Delta Live Tables pipeline. I was thinking that each developer should have their own DLT pipeline in the development workspace. Currently, each domain has its development catalog, like sal...

Data Engineering

109 Views
5 replies
4 kudos

Friday

View Replies

Latest Reply

Poorva21
New Contributor

Friday

4 kudos

Yes—each developer should have their own DLT pipeline and their own schema. It’s the correct paradigm.It keeps DLT ownership clean and prevents pipeline conflicts.Dev naming doesn’t need to be pretty; QA/Prod are where structure matters.

4 kudos

Friday

4 More Replies

by excavator-matt • Contributor

2 weeks ago

173 Views
3 replies
1 kudos

ABAC tag support for for Streaming tables (Spark Lakeflow Declarative Pipelines)?

Hi!We're using Spark Lakeflow Declarative Pipelines for ingesting data from various data sources. However, in order to achieve compliance with GDPR, we are planning to start using ABAC tagging.However, I don't understand how we are supposed to use th...

Data Engineering

abac

LakeFlow

Streaming tables

Error: oidc: fetch .well-known: Get "https://%E2%80%93host/oidc/.well-known/oauth-authorization-serv

I'm trying to authenticate databricks using WSL but suddenly getting this error./databricks-asset-bundle$ databricks auth login –host https://<XXXXXXXXX>.12.azuredatabricks.netDatabricks Profile Name:<XXXXXXXXX>Error: oidc: fetch .well-known: Get "ht...

Data Engineering

2703 Views
2 replies
2 kudos

06-27-2024 2:14:22 AM

View Replies

Latest Reply

guptadeepak
Visitor

yesterday

2 kudos

Great, these are amazing resources! I'm using them to test my IAM apps and flow.

2 kudos

yesterday

1 More Replies

by tak0519 • New Contributor II

Saturday

166 Views
5 replies
4 kudos

How can I pass parameters from DABs to something(like notebooks)?

I'm implementing DABs, Jobs, and Notebooks.For configure management, I set parameters on databricks.yml.but I can't get parameters on notebook after executed a job successfully. What I implemented ans Steps to the issue:Created "dev-catalog" on WEB U...

Data Engineering

166 Views
5 replies
4 kudos

Saturday

View Replies

Latest Reply

Taka-Yayoi
Databricks Employee

yesterday

4 kudos

Hi @tak0519 I think I found the issue! Don't worry - your DABs configuration looks correct. The problem is actually about how you're verifying the results, not the configuration itself. What's happening In your last comment, you mentioned: "Manuall...

4 kudos

yesterday

4 More Replies

by seefoods • Valued Contributor

Thursday

94 Views
1 replies
1 kudos

setup databricks connect on VsCode and PyCharm

Hello Guyz,Someone Know what's is the best pratices to setup databricks connect for Pycharm and VsCode using Docker, Justfile and .env file Cordially, Seefoods

Data Engineering

94 Views
1 replies
1 kudos

Thursday

View Replies

Latest Reply

Gecofer
Contributor

yesterday

1 kudos

Hi @seefoods!I’ve worked with Databricks Connect and VSCode in different projects, and although your question mentions Docker, Justfile and .env, the “best practices” really depend on what you’re trying to do. Here’s what has worked best for me:1.- D...

1 kudos

yesterday

by saicharandeepb • Contributor

3 weeks ago

144 Views
1 replies
2 kudos

Decision Tree for Selecting the Right VM Types in Databricks – Looking for Feedback & Improvements!

Hi everyone,I’ve been working on an updated VM selection decision tree for Azure Databricks, designed to help teams quickly identify the most suitable worker types based on workload behavior. I’m sharing the latest version (In this updated version I’...

Data Engineering

144 Views
1 replies
2 kudos

3 weeks ago

View Replies

Latest Reply

Sahil_Kumar
Databricks Employee

yesterday

2 kudos

Hi saicharandeepb, You can enrich your chart by adding GPU-accelerated VMs. For computationally challenging tasks that demand high performance, like those associated with deep learning, Azure Databricks supports compute resources that are accelerated...

2 kudos

yesterday

by singhanuj2803 • Contributor

Saturday

132 Views
4 replies
1 kudos

Troubleshooting Azure Databricks Cluster Pools & spot_bid_max_price Validation Error

Hope you’re doing well!I’m reaching out for some guidance on an issue I’ve encountered while setting up Azure Databricks Cluster Pools to reduce cluster spin-up and scale times for our jobs.Background:To optimize job execution wait times, I’ve create...

Data Engineering

132 Views
4 replies
1 kudos

Saturday

View Replies

Latest Reply

Poorva21
New Contributor

Saturday

1 kudos

Possible reasons:1. Setting spot_bid_max_price = -1 is not accepted by Azure poolsAzure Databricks only accepts:0 → on-demand onlypositive numbers → max spot price-1 is allowed in cluster policies, but not inside pools, so validation never completes....

1 kudos

Saturday

3 More Replies

by Eduard • New Contributor II

08-23-2023 1:30:35 AM

118363 Views
6 replies
1 kudos

Cluster xxxxxxx was terminated during the run.

Hello,I have a problem with the autoscaling of a cluster. Every time the autoscaling is activated I get this error. Does anyone have any idea why this could be?"Cluster xxxxxxx was terminated during the run (cluster state message: Lost communication ...

Data Engineering

118363 Views
6 replies
1 kudos

08-23-2023 1:30:35 AM

View Replies

Latest Reply

marykline
New Contributor

Saturday

1 kudos

Hello Databricks Community,The driver node was lost, which might occur as a result of network problems or malfunctioning instances, according to the error message. Here are some potential causes and remedies:Instance Instability: Consider switching t...

1 kudos

Saturday

5 More Replies

by molopocho • New Contributor

2 weeks ago

147 Views
1 replies
0 kudos

Can't create a new ETL because of compute (?)

I just create a databricks workspace with GCP with "Use existing cloud account (Storage & compute)" option. I already add a few cluster for my task but when i try to create ETL, i always get this error notification. The file is created on the specifi...

Data Engineering

147 Views
1 replies
0 kudos

2 weeks ago

View Replies

Latest Reply

Saritha_S
Databricks Employee

Saturday

0 kudos

Hi @molopocho We need to enable the feature in the workspace. If you don't see the option, then you need to reach out to the accounts team or create a ticket to databricks support team t get it enabled at the workspace level.

0 kudos

Saturday

by Poorva21 • New Contributor

Friday

133 Views
1 replies
1 kudos

Best Practices for Optimizing Databricks Costs in Production Workloads?

Hi everyone,I'm working on optimizing Databricks costs for a production-grade data pipeline (Spark + Delta Lake) on Azure. I’m looking for practical, field-tested strategies to reduce compute and storage spend without impacting performance.So far, I’...

Data Engineering

133 Views
1 replies
1 kudos

Friday

View Replies

Latest Reply

K_Anudeep
Databricks Employee

Friday

1 kudos

Hello @Poorva21 , Below are the answers to your questions: Q1. What are the most impactful cost optimisations for production pipelines? I have worked with multiple Cx and based on my knowledge, below are a high-level optimisations one must have: The ...

1 kudos

Friday

by Jpeterson • New Contributor III

11-04-2022 2:21:09 PM

6273 Views
9 replies
4 kudos

Databricks SQL Warehouse, Tableau and spark.driver.maxResultSize error

I'm attempting to create a tableau extract on tableau server with a connection to databricks large sql warehouse. The extract process fails due to spark.driver.maxResultSize error.Using a databricks interactive cluster in the data science & engineer...

Data Engineering

6273 Views
9 replies
4 kudos

11-04-2022 2:21:09 PM

View Replies

Latest Reply

CallumDean
New Contributor

Friday

4 kudos

I ran into a similar issue exporting data from Databricks to a BI tool. What helped was limiting columns, aggregating before export, and splitting large extracts into smaller chunks instead of one massive pull. I also test such tweaks in a safer envi...

4 kudos

Friday

8 More Replies

by mordex • New Contributor

Friday

201 Views
4 replies
1 kudos

Resolved! Why is spark creating 5 jobs and 200 tasks?

I am trying to read 1000 small csv files each 30 kb size which are stored in databricks volume. Below is the query i am doing:df=spark.read.csv.options(header=true).load('/path')df.collect() Why is it creating 5 jobs? Why 1-3 jobs have 200 tasks,4 ha...

030a9798-9c6f-4ab3-be53-7f6e4a5f7289.jfif

Data Engineering

201 Views
4 replies
1 kudos

Friday

View Replies

Latest Reply

Raman_Unifeye
Contributor III

Friday

1 kudos

@mordex - yes, Spark caps the parallelism for file listing at 200 tasks, regardless of whether you have 1,000 or 10,000 files. it is controlled by spark.sql.sources.parallelPartitionDiscovery.parallelism. Run below command to get value of it. spark.c...

1 kudos

Friday

3 More Replies

Databricks Community

Forum Posts

Materialized view always load full table instead of incremental

IDENTIFIER in SQL Views not supported?

replicate the behaviour of DLT create auto cdc flow

Resolved! delta live tables - collaborative development

ABAC tag support for for Streaming tables (Spark Lakeflow Declarative Pipelines)?

Error: oidc: fetch .well-known: Get "https://%E2%80%93host/oidc/.well-known/oauth-authorization-serv

How can I pass parameters from DABs to something(like notebooks)?

setup databricks connect on VsCode and PyCharm

Decision Tree for Selecting the Right VM Types in Databricks – Looking for Feedback & Improvements!

Troubleshooting Azure Databricks Cluster Pools & spot_bid_max_price Validation Error

Cluster xxxxxxx was terminated during the run.

Can't create a new ETL because of compute (?)

Best Practices for Optimizing Databricks Costs in Production Workloads?

Databricks SQL Warehouse, Tableau and spark.driver.maxResultSize error

Resolved! Why is spark creating 5 jobs and 200 tasks?

Join Us as a Local Community Builder!

delta live tables - collaborative development

Declarative Pipelines: set Merge Schema to False

Row tracking in Delta tables

mongodb connector duplicate writes

Why is spark creating 5 jobs and 200 tasks?