Data Engineering

Forum Posts

Sorted by:

by Ashwin_DSA • Databricks Employee

a week ago

94 Views
1 replies
1 kudos

Is Address Line 4 the place where data goes to die?

I’ve spent the last few years jumping between insurance, healthcare, and retail, and I’ve come to a very painful conclusion that we should never have let humans type their own addresses into a text box. For a pet project, I’m currently looking at a ...

Data Engineering

94 Views
1 replies
1 kudos

a week ago

View Replies

Latest Reply

pradeep_singh
Contributor

yesterday

1 kudos

I have never worked on this problem but based on previous posts from other community user i have come to know that fuzzy logic can help finding records that are most likely to be same or similar . Here are some links where this has been discussed i g...

1 kudos

yesterday

by Manjusha • New Contributor II

yesterday

44 Views
1 replies
0 kudos

Running python functions (written using polars) on databricks

Hi,We are planning to re-write our application ( which was originally running in R) in python. We chose to use Polars as they seems to be faster than pandas. We have functions written in R which we are planning to convert to Python.However in one of ...

Data Engineering

44 Views
1 replies
0 kudos

yesterday

View Replies

Latest Reply

pradeep_singh
Contributor

yesterday

0 kudos

Polars and pandas don’t run on the worker nodes, so you won’t get the benefits of Databricks/Spark parallelism. If your data is small enough to fit on a single driver node, you can continue to use them. If you don’t want to do any refactoring, you m...

0 kudos

yesterday

by kevinzhang29 • New Contributor II

yesterday

33 Views
1 replies
0 kudos

Issue with create_auto_cdc_flow Not Updating Business Columns for DELETE Events

We 're currently working with Databricks AUTO CDC in a data pipeline and have encountered an issue with create_auto_cdc_flow (AUTO CDC) when using SCD Type 2. We are using the following configuration: stored_as_scd_type = 2apply_as_deletes = expr("op...

Data Engineering

33 Views
1 replies
0 kudos

yesterday

View Replies

Latest Reply

pradeep_singh
Contributor

yesterday

0 kudos

Operation type DELETE means the record is supposed to disappear. If you were using SCD Type 1, the record would be removed from the silver table. When using SCD Type 2, AUTO CDC only updates the lifecycle metadata columns to make the record inactive;...

0 kudos

yesterday

by GarciaJorge • Visitor

yesterday

86 Views
3 replies
3 kudos

Resolved! DLT with CDC and schema changes in streaming pipelines

Hi everyone,I’m dealing with a scenario combining Delta Live Tables, CDC ingestion, and streaming pipelines, and I’ve hit a challenge that I haven’t seen clearly addressed in the docs.Some Context:Source is an upstream system emitting CDC events (ins...

Data Engineering

86 Views
3 replies
3 kudos

yesterday

View Replies

Latest Reply

edonaire
New Contributor

yesterday

3 kudos

In practice, the impact of adding a normalization layer is usually small compared to the gains in stability and control.At scale, the key is how you implement that layer. If it is designed to operate incrementally and aligned with your partitioning s...

3 kudos

yesterday

2 More Replies

by alexu4798644233 • New Contributor III

01-24-2025 5:22:11 AM

2468 Views
2 replies
0 kudos

ETL or Transformations Testing Framework for Databricks

Hi! I'm looking for any ETL or Transformations Testing Framework for Databricks -need to support automation of the following steps:1) create/store test datasets (mock inputs and a golden copy of the output),2) run ETL (notebook) being tested3) compar...

Data Engineering

2468 Views
2 replies
0 kudos

01-24-2025 5:22:11 AM

View Replies

Latest Reply

rameshcsert
Visitor

yesterday

0 kudos

Hi Rjdudley, tuff for me to understand the readme file and execute the framework. can you post video of how to install and use for any custom data source with customization test cases

0 kudos

yesterday

1 More Replies

by kevinleindecker • New Contributor

Monday

80 Views
3 replies
0 kudos

SQL Warehouse error: "Cannot read properties of undefined (reading 'data')" when querying system tab

Queries that previously worked started failing in SQL Warehouse (Dashboards) without any changes on our side.The query succeeds, but fails to render results with error:"Cannot read properties of undefined (reading 'data')"This happens with:- system.b...

Data Engineering

80 Views
3 replies
0 kudos

Monday

View Replies

Latest Reply

emma_s
Databricks Employee

yesterday

0 kudos

Hi, Just had a look at this and I'm trying to replicate my end, can you confirm what type of compute you're using? And is it the SQL editor that it's failing in? Also what region and cloud you're using? I ran the cost query on serverless and it ran f...

0 kudos

yesterday

2 More Replies

by rplazaman • New Contributor II

yesterday

79 Views
2 replies
1 kudos

Resolved! how to update not tracked column only in new row version in create_auto_cdc_flow

Hi, I'm using create_auto_cdc_flow, scd type 2. In source I have a metadata which tells the origin of the row. This column should not trigger new version row, so it is added to track_history_except_column_list. I don't want to add it to exception col...

Data Engineering

79 Views
2 replies
1 kudos

yesterday

View Replies

Latest Reply

lingareddy_Alva
Esteemed Contributor

yesterday

1 kudos

@rplazaman This is a well-known limitation of create_auto_cdc_flow / AUTO CDC INTO — and unfortunately there is no native way to achieve exactly what you want within the API's parameters. Here's why, and what you can do about it:The Core ProblemThe t...

1 kudos

yesterday

1 More Replies

by twbde • New Contributor

Monday

67 Views
2 replies
0 kudos

OversizedAllocationException with transformWithStateInPandas

Hello,I have a process that uses transformWithStateInPandas on a dataframe that has the content on entire files in on of the columns. Recently, the exception OversizedAllocationException has started happening. I have tried setting these configs in th...

Data Engineering

67 Views
2 replies
0 kudos

Monday

View Replies

Latest Reply

lingareddy_Alva
Esteemed Contributor

yesterday

0 kudos

Hi @twbde This is a genuinely tricky problem. Here's the diagnosis and the best available workarounds:Root Cause: useLargeVarTypes Is Not Wired Into transformWithStateInPandasYour instinct is correct. The spark.sql.execution.arrow.useLargeVarTypes co...

0 kudos

yesterday

1 More Replies

by DineshOjha • New Contributor III

a week ago

259 Views
5 replies
3 kudos

Resolved! Service Principal access notebooks created under /Workspace/Users

What permissions does a Service Principal need to run Databricks jobs that reference notebooks created by a user and stored in Git?Hi everyone,We are exploring the notebooks‑first development approach with Databricks Bundles, and we’ve run into a wor...

Data Engineering

259 Views
5 replies
3 kudos

a week ago

View Replies

Latest Reply

DineshOjha
New Contributor III

yesterday

3 kudos

Thank you so much Ashwin, this provides a lot of clarity.1. Where to deploy Bundles in the workspaceWe plan to deploy the bundle using a service principal , so the bundle we plan to deploy under /Workspace/<service_principal>1. Create notebooks under...

3 kudos

yesterday

4 More Replies

by stemill • New Contributor II

3 weeks ago

373 Views
5 replies
0 kudos

update on iceberg table creating duplicate records

We are using databricks to connect to a glue catalog which contains iceberg tables. We are using DBR 17.2 and adding the jars org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0org.apache.iceberg:iceberg-aws-bundle:1.10.0the spark config is then...

Data Engineering

373 Views
5 replies
0 kudos

3 weeks ago

View Replies

Latest Reply

aleksandra_ch
Databricks Employee

Wednesday

0 kudos

Hi @stemill , The way of connecting to Iceberg tables managed by Glue catalog that you described is not officially supported. Because spark_catalog is not a generic catalog slot – it’s a special, tightly‑wired session catalog with a lot of assumptio...

0 kudos

Wednesday

4 More Replies

by maikel • Contributor II

yesterday

35 Views
1 replies
0 kudos

Running Spark Tests

Hello Community!writing to you with the question about what are the best way to run spark unit tests in databricks. Currently we have a set of notebooks which are responsible for doing the operations on the data (joins, merging etc.).Of course to do ...

Data Engineering

35 Views
1 replies
0 kudos

yesterday

View Replies

Latest Reply

lingareddy_Alva
Esteemed Contributor

yesterday

0 kudos

Hi @maikel 1. Databricks Connect (Best fit for your situation)This is likely your best path. It lets you run Spark code locally or in CI against a real Databricks cluster/serverless compute, meaning: - Real Spark behavior, no mocking - Tests run from...

0 kudos

yesterday

by beaglerot • Databricks Partner

yesterday

84 Views
2 replies
3 kudos

Python Data Source API — worth using?

Hi all,I’ve been looking into the Python Data Source API and wanted to get some feedback from others who may be experimenting with it.One of the more common challenges I run into is working with applications that expose APIs but don’t have out-of-the...

Data Engineering

84 Views
2 replies
3 kudos

yesterday

View Replies

Latest Reply

beaglerot
Databricks Partner

yesterday

3 kudos

My use case is for a personal project. I'm pulling all my contacts into Databricks from the Google People API. I don't have a huge list of contacts and they don't change very often, so using the python data source API and landing the data directly in...

3 kudos

yesterday

1 More Replies

by demo-user • New Contributor III

02-24-2026 9:01:11 AM

214 Views
2 replies
0 kudos

S3A Connector Trying to Use AWS STS on Non-AWS S3 Endpoint

Hi everyone,I’m trying to write Delta tables to my S3-compatible (non-AWS) endpoint, and it was writing perfectly fine last week with the same setup. Now, without any changes on my end, it’s failing and giving me anUnknownException: (com.amazonaws.se...

Data Engineering

214 Views
2 replies
0 kudos

02-24-2026 9:01:11 AM

View Replies

Latest Reply

aleksandra_ch
Databricks Employee

yesterday

0 kudos

Hi @demo-user , Can you share more information about your setup: Cluster type and DBR versionS3-compatible storage implementation (MinIO / something else?) AFAIK this is not supposed to work as Delta client in DBR relies on AWS STS to perform S3 comm...

0 kudos

yesterday

1 More Replies

by BennyBoyW • New Contributor

Monday

155 Views
4 replies
3 kudos

Resolved! How to Convert a Lateral View to a Table Reference

Hi AllI have a view creation script in DataBricks which uses a lateral view to access columns in a structure held within an array field. It is working fine but I have noted that the LATERAL VIEW is now depracated and that I should be using a TABLE RE...

Data Engineering

155 Views
4 replies
3 kudos

Monday

View Replies

Latest Reply

balajij8
Contributor

Monday

3 kudos

You can useCREATE OR REPLACE VIEW newview AS SELECT t1.field1, item.field2, item.field3 FROM table1 AS t1 INNER JOIN table2 AS t2 ON t1.id = t2.id , LATERAL EXPLODE(t1.structure) AS structureitem(item)

3 kudos

Monday

3 More Replies

by MaartenH • New Contributor III

05-07-2025 1:58:47 AM

3682 Views
11 replies
4 kudos

Lakehouse federation for SQL server: database name with spaces

We're currently using lakehouse federation for various sources (Snowflake, SQL Server); usually succesful. However we've encountered a case where one of the databases on the SQL Server has spaces in its name, e.g. 'My Database Name'. We've tried vari...

Data Engineering

3682 Views
11 replies
4 kudos

05-07-2025 1:58:47 AM

View Replies

Latest Reply

QueryingQuail
New Contributor III

yesterday

4 kudos

Hello all,We have a good amount of tables from an external ERP system that are being replicated to an existing dwh in an Azure SQL Server database.We have set up a foreign connection for this database and we can connect to the server and database. Sa...

4 kudos

yesterday

10 More Replies

Databricks Community

Forum Posts

Is Address Line 4 the place where data goes to die?

Running python functions (written using polars) on databricks

Issue with create_auto_cdc_flow Not Updating Business Columns for DELETE Events

Resolved! DLT with CDC and schema changes in streaming pipelines

ETL or Transformations Testing Framework for Databricks

SQL Warehouse error: "Cannot read properties of undefined (reading 'data')" when querying system tab

Resolved! how to update not tracked column only in new row version in create_auto_cdc_flow

OversizedAllocationException with transformWithStateInPandas

Resolved! Service Principal access notebooks created under /Workspace/Users

update on iceberg table creating duplicate records

Running Spark Tests

Python Data Source API — worth using?

S3A Connector Trying to Use AWS STS on Non-AWS S3 Endpoint

Resolved! How to Convert a Lateral View to a Table Reference

Lakehouse federation for SQL server: database name with spaces

Issue while handling Deletes and Inserts in Struct...

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template

Use .R file in data pipeline