Data Engineering

Forum Posts

Sorted by:

by shubham_007 • Contributor III

02-18-2025 6:30:54 AM

4006 Views
1 replies
0 kudos

Dear experts, need urgent help on logic.

Dear experts,I am facing difficulty while developing pyspark automation logic on “Developing automation logic to delete/remove display() and cache() method used in scripts in multiple databricks notebooks (tasks)”.kindly advise on developing automati...

Data Engineering

4006 Views
1 replies
0 kudos

02-18-2025 6:30:54 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:22:04 AM

0 kudos

To automate the removal of display() and cache() method calls from multiple PySpark scripts in Databricks notebooks, develop a script that programmatically processes exportable notebook source files (usually in .dbc or .ipynb format) using text-based...

0 kudos

10-31-2025 8:22:04 AM

by ADuma • New Contributor III

02-14-2025 12:04:44 AM

4267 Views
1 replies
0 kudos

Strcutured Streaming with queue in separate storage account

Hello,we are running a structured streaming job which consumes zipped Json files that arrive in our Azure Prod storage account. We are using AutoLoader and have set up an Eventgrid Queue which we pass to the streaming job using cloudFiles.queueName. ...

Data Engineering

4267 Views
1 replies
0 kudos

02-14-2025 12:04:44 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:19:32 AM

0 kudos

You are attempting to have your Test Databricks streaming job consume files that arrive in your Prod storage, using AutoLoader and EventGrid notifications, without physically copying the data or EventGrid queue to Test. The core challenge is that Eve...

0 kudos

10-31-2025 8:19:32 AM

by turagittech • Contributor

02-13-2025 2:47:57 PM

4404 Views
1 replies
0 kudos

Identify source of data in query

Hi All,I have an issue. I have several databases with the same schemas I need to source data from. Those databases are going to end up aggregated in a data warehouse. The problem is the id column in each means different things. Example: a client id i...

Data Engineering

4404 Views
1 replies
0 kudos

02-13-2025 2:47:57 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:18:20 AM

0 kudos

Migrating from Data Factory to Databricks for ETL and warehousing is a solid choice, especially for flexibility and cost-effectiveness in data engineering projects. The core issue—disambiguating “id” fields that are only unique within each source dat...

0 kudos

10-31-2025 8:18:20 AM

by jeremy98 • Honored Contributor

02-12-2025 5:35:34 AM

6227 Views
2 replies
0 kudos

Best practice on how to set up a medallion architecture pipelines inside DAB

Hi Community,My team and I are working on refactoring our folder repository structure. Currently, I have been placing pipelines related to the Medallion architecture inside a folder named notebook/. However, I believe they should be moved to src/ sin...

Data Engineering

6227 Views
2 replies
0 kudos

02-12-2025 5:35:34 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:17:08 AM

0 kudos

Refactoring your folder structure and naming conventions for Medallion architecture pipelines is an essential step to keep code maintainable and intuitive. Based on your context, shifting these pipelines from notebook/ to src/ is a solid move, especi...

0 kudos

10-31-2025 8:17:08 AM

1 More Replies

by MaximeGendre • New Contributor III

02-12-2025 1:20:23 PM

4361 Views
2 replies
0 kudos

RLS function : concat vs list

Hello all, I'm designing a function to implement RLS on Unity Catalog for multiple tables of different size (1k to 10G rows).RLS will be applied on two columns and 150+ groups.I wonder what would be more performant :Solution 1: exhaustive (boring) li...

Data Engineering

4361 Views
2 replies
0 kudos

02-12-2025 1:20:23 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:15:39 AM

0 kudos

The more performant solution for Row-Level Security (RLS) in Unity Catalog, when applying to two columns and 150+ groups, generally depends on how much of the access check logic can be pushed into efficient, indexable predicates versus computed at ru...

0 kudos

10-31-2025 8:15:39 AM

1 More Replies

by kenmyers-8451 • Contributor II

02-12-2025 3:04:56 PM

4715 Views
2 replies
0 kudos

Long runtimes on simple copying of data

Hi my team has been trying to identify areas where we can improve our processes. We have some long runtimes on processes that have multiple joins and aggregations. To create a baseline we have been running tests on a simple select and write operation...

Data Engineering

4715 Views
2 replies
0 kudos

02-12-2025 3:04:56 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:14:34 AM

0 kudos

Your slow Spark runtime and unexpectedly long WholeStageCodeGen compute times are likely tied to a mix of Delta Lake features (especially deletion vectors), Spark’s physical plan, and partition handling. Here’s a detailed breakdown and advice based o...

0 kudos

10-31-2025 8:14:34 AM

1 More Replies

by saadi • New Contributor

02-12-2025 10:48:32 PM

4465 Views
1 replies
0 kudos

Resolved! Could not connect Self Hosted MySQL Database in Azure Databricks

Hi,I am trying to connect a self-hosted MySQL database in Databricks but keep encountering errors.Database Setup:The MySQL database is hosted on a VM.We use DBeaver or Navicat to query it.Connection to the database requires an active Azure VPN Client...

Data Engineering

4465 Views
1 replies
0 kudos

02-12-2025 10:48:32 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:12:49 AM

0 kudos

To connect a self-hosted MySQL database (on a VM, Azure VPN required) to Databricks, you need several components to align: network access from Databricks to MySQL, proper JDBC connector configuration, and correct authentication. This setup is common ...

0 kudos

10-31-2025 8:12:49 AM

by nishg • New Contributor III

02-10-2025 8:17:42 AM

4058 Views
1 replies
0 kudos

Upgraded cluster to 16.1/16.2 and upload data(append) to elastic index is failling

I have updated compute cluster to both databricks version 16.1 and 16.2 and run the workflow to append data into elastic index but it started failing with below error. The same job is working fine with databricks version 15. Let me know if anyone co...

Data Engineering

4058 Views
1 replies
0 kudos

02-10-2025 8:17:42 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:11:04 AM

0 kudos

Your error is a known issue appearing after upgrading Databricks clusters to versions 16.1 and 16.2, specifically when running workflows to append data into an Elasticsearch index. This error—"Path must be absolute: myindex/_delta_log"—indicates a ch...

0 kudos

10-31-2025 8:11:04 AM

by Sujith_i • New Contributor

02-05-2025 9:04:08 AM

4296 Views
1 replies
1 kudos

databricks sdk for python authentication failing

I am trying to use databricks sdk for python to do some account level operations like creating groups and created a databricks config file locally n provided the profile name as argument to AccountClient but authentication keeps failing. the same con...

Data Engineering

4296 Views
1 replies
1 kudos

02-05-2025 9:04:08 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:09:41 AM

1 kudos

Authentication for account-level operations with Databricks SDK for Python requires more than just referencing the profile name in your local .databrickscfg file. While the CLI consults .databrickscfg for profiles and can use them directly, the SDK's...

1 kudos

10-31-2025 8:09:41 AM

by AvneeshSingh • New Contributor

02-05-2025 11:27:29 PM

4206 Views
2 replies
1 kudos

Autloader Data Reprocess

Hi ,If possible can any please help me with some autloader options I have 2 open queries ,(i) Let assume I am running some autoloader stream and if my job fails, so instead of resetting the whole checkpoint, I want to run stream from specified timest...

Data Engineering

autoloader

4206 Views
2 replies
1 kudos

02-05-2025 11:27:29 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:07:33 AM

1 kudos

In Databricks Autoloader, controlling the starting point for streaming data after a job failure requires careful management of checkpoints and configuration options. By default, Autoloader uses checkpoints to remember where the stream last left off, ...

1 kudos

10-31-2025 8:07:33 AM

1 More Replies

by Nidhig • Databricks Partner

10-30-2025 9:13:55 AM

681 Views
1 replies
2 kudos

Resolved! Global Parameter at the Pipeline level in Lakeflow Job

Hi ,any work around or Databricks can enable global parameters feature at the pipeline level in the lakeflow job.Currently I am working on migrating adf pipeline schedule set up to lakeflow job.

Data Engineering

681 Views
1 replies
2 kudos

10-30-2025 9:13:55 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:05:31 AM

2 kudos

Databricks Lakeflow Declarative Pipelines do not currently support truly global parameters at the pipeline level in the same way that Azure Data Factory (ADF) allows, but there are workarounds that enable parameterization to streamline migration from...

2 kudos

10-31-2025 8:05:31 AM

by VaDim • New Contributor III

10-22-2025 7:00:08 AM

382 Views
1 replies
0 kudos

Resolved! transformWithStateInPandas. Invalid pickle opcode when updating ValueState with large (float) array

I am getting an error when the entity I need to store in a ValueState is a large array (over 15k-20k items). No error (and works correctly) if I trim the array to under 10k samples. The same error is raised when using it as a value for MapState or as...

Data Engineering

382 Views
1 replies
0 kudos

10-22-2025 7:00:08 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:04:18 AM

0 kudos

The error you’re facing, specifically PySparkRuntimeError: Error updating value state: invalid pickle opcod, usually points to a serialization (pickling) problem when storing large arrays in Flink/Spark state such as ValueState, ListState, or MapStat...

0 kudos

10-31-2025 8:04:18 AM

by SamAdams • Contributor

10-29-2025 11:19:09 AM

302 Views
1 replies
0 kudos

Time window for "All tables are updated" option in job Table Update Trigger

I've been using the Table Update Trigger for some SQL alert workflows. I have a job that uses 3 tables with an "All tables updated" trigger:Table 1 was updated at 07:20 UTCTable 2 was updated at 16:48 UTCTable 3 was updated at 16:50 UTC-> Job is trig...

Data Engineering

jobs

TableUpdateTrigger

302 Views
1 replies
0 kudos

10-29-2025 11:19:09 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

10-31-2025 8:01:31 AM

0 kudos

There is no fixed or documented “window” time for the interval between updates to all monitored tables before a job with an "All tables updated" trigger runs in Databricks. The job is triggered as soon as every table in the set has seen at least one ...

0 kudos

10-31-2025 8:01:31 AM

by ak5har • New Contributor II

03-04-2025 3:41:22 AM

4940 Views
9 replies
2 kudos

Databricks connection to on-prem cloudera

Hello, we are trying to evaluate Databricks solution to extract the data from existing cloudera schema hosted on physical server. We are using the Databricks serverless compute provided by databricks express setup and we assume we will not need t...

Data Engineering

4940 Views
9 replies
2 kudos

03-04-2025 3:41:22 AM

View Replies

Latest Reply

Adrian_Ashley
Databricks Partner

10-31-2025 7:22:31 AM

2 kudos

I work for a databricks partner called Cirata. Our Data migrator offering allows both data and metadata replication from cloudera to be delivered to the databricks environment , whether this is just delivering it to the ADLS2 object storage or to ...

2 kudos

10-31-2025 7:22:31 AM

8 More Replies

by pepco • New Contributor II

10-30-2025 5:00:14 PM

1017 Views
2 replies
2 kudos

Resolved! Environment in serverless

I'm playing little bit with on the Databricks free environment and I'm super confused by the documentation vs actual behavior. Maybe you could help me to understand better.For the workspace I can define base environment which I can use in serverless ...

Data Engineering

base environment

serverless

1017 Views
2 replies
2 kudos

10-30-2025 5:00:14 PM

View Replies

Latest Reply

K_Anudeep
Databricks Employee

10-31-2025 4:15:22 AM

2 kudos

Hello @pepco , Is it possible to use environments with notebook tasks? Yes—but only in a very specific way. Notebook tasks can use base environments, but you don’t attach them in the job’s YAML. You pick the base env in the notebook’s Environment sid...

2 kudos

10-31-2025 4:15:22 AM

1 More Replies

Databricks Community

Forum Posts

Dear experts, need urgent help on logic.

Strcutured Streaming with queue in separate storage account

Identify source of data in query

Best practice on how to set up a medallion architecture pipelines inside DAB

RLS function : concat vs list

Long runtimes on simple copying of data

Resolved! Could not connect Self Hosted MySQL Database in Azure Databricks

Upgraded cluster to 16.1/16.2 and upload data(append) to elastic index is failling

databricks sdk for python authentication failing

Autloader Data Reprocess

Resolved! Global Parameter at the Pipeline level in Lakeflow Job

Resolved! transformWithStateInPandas. Invalid pickle opcode when updating ValueState with large (float) array

Time window for "All tables are updated" option in job Table Update Trigger

Databricks connection to on-prem cloudera

Resolved! Environment in serverless

File Arrival Trigger - Multiple tables

Issue while handling Deletes and Inserts in Struct...

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template