cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

murtadha_s
by Databricks Partner
  • 367 Views
  • 1 replies
  • 0 kudos

Default ACL for Jobs and Clusters

Hi,I want to set default ACL that applies to all created jobs and clusters, according to a cluster policy for example, but currently I need to apply my ACL at every created job/cluster separately.is there a way to do that?BR,

  • 367 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 0 kudos

Hi @murtadha_s Can you please clarify what you are after? The second part of your question sounded more like a statement: "but currently I need to apply my ACL at every created job/cluster separately," and that confused me a bit.  To make sure we poi...

  • 0 kudos
sai_sakhamuri
by Databricks Partner
  • 4034 Views
  • 1 replies
  • 1 kudos

Resolved! Databricks optimization for query perfomance and pipeline run

I am currently working on optimizing several Spark pipelines and wanted to gather community insights on advanced performance tuning. Typically, my workflow for traditional SQL optimization involves a deep dive into the execution plan to identify bott...

  • 4034 Views
  • 1 replies
  • 1 kudos
Latest Reply
lingareddy_Alva
Esteemed Contributor
  • 1 kudos

Hi @sai_sakhamuri You're clearly past the basics. Let me give you a practitioner-level breakdown of each layer you mentioned, plus a few things that often get overlooked.Spark Catalyst Optimizer — Working With the Rules EngineCatalyst operates in fou...

  • 1 kudos
databrciks
by New Contributor III
  • 939 Views
  • 3 replies
  • 1 kudos

Resolved! Parametrize the DLT pipeline for dynamic loading of many tables

I need to load many tables into Bronze layer connecting to sql server DB. How can i pass the tables names dynamically in DLT. Means one code pass many tables and load into bronze layer

  • 939 Views
  • 3 replies
  • 1 kudos
Latest Reply
databrciks
New Contributor III
  • 1 kudos

Hi Ashwin Thanks for the quick response. Yes I want to pass all the tables through config parameter/param file and load that into bronze layerI will try this approach. Thanks 

  • 1 kudos
2 More Replies
ittzzmalind
by New Contributor III
  • 452 Views
  • 2 replies
  • 0 kudos

DLT Pipeline Error -key not found: all_info_dlt_cx_utils_cod resulting in a NoSuchElementException.

Databricks ETL pipeline, specifically an error with the @DP.expectorfail decorator causing the pipeline update to fail. The error message indicated a 'key not found: all_info_dlt_cx_utils_cod ' resulting in a NoSuchElementException.Note: if we commen...

  • 452 Views
  • 2 replies
  • 0 kudos
Latest Reply
ittzzmalind
New Contributor III
  • 0 kudos

@MoJaMa Thanks for the reply, The issue was in the code, corrected code worked

  • 0 kudos
1 More Replies
IM_01
by Valued Contributor
  • 2983 Views
  • 19 replies
  • 3 kudos

Resolved! Lakeflow SDP failed with DELTA_STREAMING_INCOMPATIBLE_SCHEMA_CHANGE_USE_LOG

Hi,A column was deleted on the source table, when I ran LSDP it failed with error DELTA_STREAMING_INCOMPATIBLE_SCHEMA_CHANGE_USE_LOG : Streaming read is not supported on tables with read-incompatible schema changes( e.g: rename or drop or datatype ch...

  • 2983 Views
  • 19 replies
  • 3 kudos
Latest Reply
gullsher98743
New Contributor II
  • 3 kudos

This looks like a very practical template, especially for teams trying to structure their Data & AI strategy without overcomplicating things. The step-by-step format and examples should be really helpful for workshops and collaborative sessions. Curi...

  • 3 kudos
18 More Replies
stemill
by New Contributor II
  • 2347 Views
  • 7 replies
  • 0 kudos

update on iceberg table creating duplicate records

We are using databricks to connect to a glue catalog which contains iceberg tables. We are using DBR 17.2 and adding the jars org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0org.apache.iceberg:iceberg-aws-bundle:1.10.0the spark config is then...

  • 2347 Views
  • 7 replies
  • 0 kudos
Latest Reply
aleksandra_ch
Databricks Employee
  • 0 kudos

Hi  @stemill , The way of connecting to Iceberg tables managed by Glue catalog that you described is not officially supported. Because spark_catalog is not a generic catalog slot – it’s a special, tightly‑wired session catalog with a lot of assumptio...

  • 0 kudos
6 More Replies
beaglerot
by Databricks Partner
  • 1561 Views
  • 4 replies
  • 6 kudos

Resolved! Python Data Source API — worth using?

Hi all,I’ve been looking into the Python Data Source API and wanted to get some feedback from others who may be experimenting with it.One of the more common challenges I run into is working with applications that expose APIs but don’t have out-of-the...

  • 1561 Views
  • 4 replies
  • 6 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 6 kudos

Adding on to @edonaire, which are accurate. @beaglerot , your contacts project is the right use case for the pattern you have. Small data, infrequent changes, direct read into bronze. That works. The real question you're asking is what happens when t...

  • 6 kudos
3 More Replies
maikel
by Contributor III
  • 795 Views
  • 2 replies
  • 2 kudos

Running Spark Tests

Hello Community!writing to you with the question about what are the best way to run spark unit tests in databricks. Currently we have a set of notebooks which are responsible for doing the operations on the data (joins, merging etc.).Of course to do ...

  • 795 Views
  • 2 replies
  • 2 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 2 kudos

Great suggestions  @lingareddy_Alva  regarding Databricks Connect v2! @maikel , A few things to layer on top of that. First, the fact that you already have your functions in a separate directory outside of notebooks is exactly the right foundation. T...

  • 2 kudos
1 More Replies
Malthe
by Valued Contributor II
  • 914 Views
  • 1 replies
  • 0 kudos

Observable API and Delta Table merge

Using the Observable API on the source dataframe to a Delta Table merge seems to hang indefinitely.Steps to reproduce:Create one or more pyspark.sql.Observation objects.Use DataFrame.observe on the merge source.Run merge.Accessing Observation.get blo...

  • 914 Views
  • 1 replies
  • 0 kudos
Latest Reply
AnthonyAnand
Databricks Partner
  • 0 kudos

Hi @Malthe,   You have hit a very specific, known behavioral gap in how Apache Spark and Delta Lake interact. To answer your question directly: Yes, the Observable API is effectively incompatible with Delta Table merges when used directly. Why It ...

  • 0 kudos
Ashwin_DSA
by Databricks Employee
  • 573 Views
  • 1 replies
  • 2 kudos

Is Address Line 4 the place where data goes to die?

I’ve spent the last few years jumping between insurance, healthcare, and retail, and I’ve come to a very painful conclusion that we should never have let humans type their own addresses into a text box.  For a pet project, I’m currently looking at a ...

  • 573 Views
  • 1 replies
  • 2 kudos
Latest Reply
pradeep_singh
Contributor III
  • 2 kudos

I have never worked on this problem but based on previous posts from other community user i have come to know that fuzzy logic can help finding records that are most likely to be same or similar . Here are some links where this has been discussed i g...

  • 2 kudos
kevinzhang29
by New Contributor III
  • 955 Views
  • 1 replies
  • 1 kudos

Resolved! Issue with create_auto_cdc_flow Not Updating Business Columns for DELETE Events

We 're currently working with Databricks AUTO CDC in a data pipeline and have encountered an issue with create_auto_cdc_flow (AUTO CDC) when using SCD Type 2. We are using the following configuration: stored_as_scd_type = 2apply_as_deletes = expr("op...

  • 955 Views
  • 1 replies
  • 1 kudos
Latest Reply
pradeep_singh
Contributor III
  • 1 kudos

Operation type DELETE means the record is supposed to disappear. If you were using SCD Type 1, the record would be removed from the silver table. When using SCD Type 2, AUTO CDC only updates the lifecycle metadata columns to make the record inactive;...

  • 1 kudos
GarciaJorge
by New Contributor III
  • 2082 Views
  • 3 replies
  • 5 kudos

Resolved! DLT with CDC and schema changes in streaming pipelines

Hi everyone,I’m dealing with a scenario combining Delta Live Tables, CDC ingestion, and streaming pipelines, and I’ve hit a challenge that I haven’t seen clearly addressed in the docs.Some Context:Source is an upstream system emitting CDC events (ins...

  • 2082 Views
  • 3 replies
  • 5 kudos
Latest Reply
edonaire
Contributor III
  • 5 kudos

In practice, the impact of adding a normalization layer is usually small compared to the gains in stability and control.At scale, the key is how you implement that layer. If it is designed to operate incrementally and aligned with your partitioning s...

  • 5 kudos
2 More Replies
alexu4798644233
by New Contributor III
  • 3092 Views
  • 2 replies
  • 0 kudos

ETL or Transformations Testing Framework for Databricks

Hi! I'm looking for any ETL or Transformations Testing Framework for Databricks -need to support automation of the following steps:1) create/store test datasets (mock inputs and a golden copy of the output),2) run ETL (notebook) being tested3) compar...

  • 3092 Views
  • 2 replies
  • 0 kudos
Latest Reply
rameshcsert
New Contributor II
  • 0 kudos

Hi Rjdudley, tuff for me to understand the readme file and execute the framework. can you post video of how to install and use for any custom data source with customization test cases

  • 0 kudos
1 More Replies
rplazaman
by New Contributor II
  • 1043 Views
  • 2 replies
  • 2 kudos

Resolved! how to update not tracked column only in new row version in create_auto_cdc_flow

Hi, I'm using create_auto_cdc_flow, scd type 2. In source I have a metadata which tells the origin of the row. This column should not trigger new version row, so it is added to track_history_except_column_list. I don't want to add it to exception col...

  • 1043 Views
  • 2 replies
  • 2 kudos
Latest Reply
lingareddy_Alva
Esteemed Contributor
  • 2 kudos

@rplazaman This is a well-known limitation of create_auto_cdc_flow / AUTO CDC INTO — and unfortunately there is no native way to achieve exactly what you want within the API's parameters. Here's why, and what you can do about it:The Core ProblemThe t...

  • 2 kudos
1 More Replies
twbde
by New Contributor II
  • 620 Views
  • 2 replies
  • 1 kudos

Resolved! OversizedAllocationException with transformWithStateInPandas

Hello,I have a process that uses transformWithStateInPandas on a dataframe that has the content on entire files in on of the columns. Recently, the exception OversizedAllocationException has started happening. I have tried setting these configs in th...

  • 620 Views
  • 2 replies
  • 1 kudos
Latest Reply
lingareddy_Alva
Esteemed Contributor
  • 1 kudos

Hi @twbde This is a genuinely tricky problem. Here's the diagnosis and the best available workarounds:Root Cause: useLargeVarTypes Is Not Wired Into transformWithStateInPandasYour instinct is correct. The spark.sql.execution.arrow.useLargeVarTypes co...

  • 1 kudos
1 More Replies
Labels