Data Engineering

Forum Posts

Sorted by:

by SrihariB • New Contributor

02-25-2025 2:59:41 PM

3683 Views
1 replies
0 kudos

Read from multiple sources in a single stream

Hey all, I am trying to read data from multiple s3 locations using a single stream DLT pipeline and loading data into a single target. Here is the scenario. S3 Locations: Below are my s3 raw locations with change in the directory names at the end. Ba...

Data Engineering

3683 Views
1 replies
0 kudos

02-25-2025 2:59:41 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

You are using Databricks Autoloader (cloudFiles) within a Delta Live Tables (DLT) pipeline to ingest streaming Parquet data from multiple S3 directories with a wildcard pattern, and you want to ensure all matching directories’ data is included in a s...

0 kudos

Friday

by Mauro • New Contributor II

02-20-2025 4:31:05 AM

3237 Views
1 replies
0 kudos

DLT change in hive metastore destination to unity catalog

A change recently came out in which Databricks necessarily requires using the Unity Catalog as the output of a DLT because previously it was HiveMetaStore. At first I was working using CDC plus expectations which resulted in the "allow_expectations_c...

Data Engineering

3237 Views
1 replies
0 kudos

02-20-2025 4:31:05 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

Databricks has recently enforced Unity Catalog as the output target for Delta Live Tables (DLT), replacing the legacy Hive Metastore approach. As a result, the familiar "allow_expectations_col" column, which was automatically added to help track and ...

0 kudos

Friday

by Yuppp • New Contributor

02-19-2025 6:35:15 AM

3762 Views
1 replies
0 kudos

Need help with setting up ForEach task in Databricks

Hi everyone,I have a workflow involving two notebooks: Notebook A and Notebook B. At the end of Notebook A, we generate a variable number of files, let's call it N. I want to run Notebook B for each of these N files.I know Databricks has a Foreach ta...

Data Engineering

ForEach

Workflows

3762 Views
1 replies
0 kudos

02-19-2025 6:35:15 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

You can use Databricks Workflows' foreach task to handle running Notebook B for each file generated in Notebook A. The key is to pass each path as a parameter to Notebook B using Databricks task values and workflows features, not widgets set manually...

0 kudos

Friday

by ironv • New Contributor

02-19-2025 4:19:09 PM

3688 Views
1 replies
0 kudos

using concurrent.futures for parallelization

Hi, trying to copy a table with billions of rows from an enterprise data source into my databricks table. To do this, I need to use a homegrown library which handles auth etc, runs the query and return a dataframe. I am partitioning the table using...

Data Engineering

3688 Views
1 replies
0 kudos

02-19-2025 4:19:09 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

The "SparkSession$ does not exist in the JVM" error in your scenario is almost always due to the use of multiprocessing (like ProcessPoolExecutor) with Spark. Spark contexts and sessions cannot safely be shared across processes, especially in Databri...

0 kudos

Friday

by umahesb3 • New Contributor

02-04-2025 11:48:59 PM

3839 Views
1 replies
0 kudos

Facing issues databricks asset bundle, All jobs are getting Deployed into specified targets Instead

Facing issues databricks asset bundle, All jobs are getting Deployed into specified targets Instead of defined target following was files i am using resourser yaml and databricks yml file , i am using Databricks CLI v0.240.0 , i am using databricks b...

Data Engineering

3839 Views
1 replies
0 kudos

02-04-2025 11:48:59 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

The issue you’re facing—where all Databricks Asset Bundle jobs are being deployed to all targets instead of only the defined target(s)—appears to be a known limitation in how the bundle resource inclusion and target mapping works in the Databricks CL...

0 kudos

Friday

by shubham_007 • Contributor III

02-18-2025 6:30:54 AM

3248 Views
1 replies
0 kudos

Dear experts, need urgent help on logic.

Dear experts,I am facing difficulty while developing pyspark automation logic on “Developing automation logic to delete/remove display() and cache() method used in scripts in multiple databricks notebooks (tasks)”.kindly advise on developing automati...

Data Engineering

3248 Views
1 replies
0 kudos

02-18-2025 6:30:54 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

To automate the removal of display() and cache() method calls from multiple PySpark scripts in Databricks notebooks, develop a script that programmatically processes exportable notebook source files (usually in .dbc or .ipynb format) using text-based...

0 kudos

Friday

by akuma643 • New Contributor II

02-18-2025 6:48:44 AM

3464 Views
2 replies
0 kudos

The authentication value "ActiveDirectoryManagedIdentity" is not valid.

Hi Team,i am trying to connect to SQL server hosted in azure vm using Entra id authentication from Databricks.("authentication", "ActiveDirectoryManagedIdentity")Below is the notebook script i am using. driver = "com.microsoft.sqlserver.jdbc.SQLServe...

Data Engineering

3464 Views
2 replies
0 kudos

02-18-2025 6:48:44 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

You are encountering an error because the default SQL Server JDBC driver bundled with Databricks may not fully support the authentication value "ActiveDirectoryManagedIdentity"—this option requires at least version 10.2.0 of the Microsoft SQL Server ...

0 kudos

Friday

1 More Replies

by ADuma • New Contributor III

02-14-2025 12:04:44 AM

3399 Views
1 replies
0 kudos

Strcutured Streaming with queue in separate storage account

Hello,we are running a structured streaming job which consumes zipped Json files that arrive in our Azure Prod storage account. We are using AutoLoader and have set up an Eventgrid Queue which we pass to the streaming job using cloudFiles.queueName. ...

Data Engineering

3399 Views
1 replies
0 kudos

02-14-2025 12:04:44 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

You are attempting to have your Test Databricks streaming job consume files that arrive in your Prod storage, using AutoLoader and EventGrid notifications, without physically copying the data or EventGrid queue to Test. The core challenge is that Eve...

0 kudos

Friday

by turagittech • Contributor

02-13-2025 2:47:57 PM

3369 Views
1 replies
0 kudos

Identify source of data in query

Hi All,I have an issue. I have several databases with the same schemas I need to source data from. Those databases are going to end up aggregated in a data warehouse. The problem is the id column in each means different things. Example: a client id i...

Data Engineering

3369 Views
1 replies
0 kudos

02-13-2025 2:47:57 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

Migrating from Data Factory to Databricks for ETL and warehousing is a solid choice, especially for flexibility and cost-effectiveness in data engineering projects. The core issue—disambiguating “id” fields that are only unique within each source dat...

0 kudos

Friday

by jeremy98 • Honored Contributor

02-12-2025 5:35:34 AM

3978 Views
2 replies
0 kudos

Best practice on how to set up a medallion architecture pipelines inside DAB

Hi Community,My team and I are working on refactoring our folder repository structure. Currently, I have been placing pipelines related to the Medallion architecture inside a folder named notebook/. However, I believe they should be moved to src/ sin...

Data Engineering

3978 Views
2 replies
0 kudos

02-12-2025 5:35:34 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

Refactoring your folder structure and naming conventions for Medallion architecture pipelines is an essential step to keep code maintainable and intuitive. Based on your context, shifting these pipelines from notebook/ to src/ is a solid move, especi...

0 kudos

Friday

1 More Replies

by MaximeGendre • New Contributor III

02-12-2025 1:20:23 PM

3351 Views
2 replies
0 kudos

RLS function : concat vs list

Hello all, I'm designing a function to implement RLS on Unity Catalog for multiple tables of different size (1k to 10G rows).RLS will be applied on two columns and 150+ groups.I wonder what would be more performant :Solution 1: exhaustive (boring) li...

Data Engineering

3351 Views
2 replies
0 kudos

02-12-2025 1:20:23 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

The more performant solution for Row-Level Security (RLS) in Unity Catalog, when applying to two columns and 150+ groups, generally depends on how much of the access check logic can be pushed into efficient, indexable predicates versus computed at ru...

0 kudos

Friday

1 More Replies

by kenmyers-8451 • Contributor

02-12-2025 3:04:56 PM

3582 Views
2 replies
0 kudos

Long runtimes on simple copying of data

Hi my team has been trying to identify areas where we can improve our processes. We have some long runtimes on processes that have multiple joins and aggregations. To create a baseline we have been running tests on a simple select and write operation...

Data Engineering

3582 Views
2 replies
0 kudos

02-12-2025 3:04:56 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

Your slow Spark runtime and unexpectedly long WholeStageCodeGen compute times are likely tied to a mix of Delta Lake features (especially deletion vectors), Spark’s physical plan, and partition handling. Here’s a detailed breakdown and advice based o...

0 kudos

Friday

1 More Replies

by saadi • New Contributor

02-12-2025 10:48:32 PM

3342 Views
1 replies
0 kudos

Could not connect Self Hosted MySQL Database in Azure Databricks

Hi,I am trying to connect a self-hosted MySQL database in Databricks but keep encountering errors.Database Setup:The MySQL database is hosted on a VM.We use DBeaver or Navicat to query it.Connection to the database requires an active Azure VPN Client...

Data Engineering

3342 Views
1 replies
0 kudos

02-12-2025 10:48:32 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

To connect a self-hosted MySQL database (on a VM, Azure VPN required) to Databricks, you need several components to align: network access from Databricks to MySQL, proper JDBC connector configuration, and correct authentication. This setup is common ...

0 kudos

Friday

by nishg • New Contributor II

02-10-2025 8:17:42 AM

3245 Views
1 replies
0 kudos

Upgraded cluster to 16.1/16.2 and upload data(append) to elastic index is failling

I have updated compute cluster to both databricks version 16.1 and 16.2 and run the workflow to append data into elastic index but it started failing with below error. The same job is working fine with databricks version 15. Let me know if anyone co...

Data Engineering

3245 Views
1 replies
0 kudos

02-10-2025 8:17:42 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

0 kudos

Your error is a known issue appearing after upgrading Databricks clusters to versions 16.1 and 16.2, specifically when running workflows to append data into an Elasticsearch index. This error—"Path must be absolute: myindex/_delta_log"—indicates a ch...

0 kudos

Friday

by Sujith_i • New Contributor

02-05-2025 9:04:08 AM

3331 Views
1 replies
1 kudos

databricks sdk for python authentication failing

I am trying to use databricks sdk for python to do some account level operations like creating groups and created a databricks config file locally n provided the profile name as argument to AccountClient but authentication keeps failing. the same con...

Data Engineering

3331 Views
1 replies
1 kudos

02-05-2025 9:04:08 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

Friday

1 kudos

Authentication for account-level operations with Databricks SDK for Python requires more than just referencing the profile name in your local .databrickscfg file. While the CLI consults .databrickscfg for profiles and can use them directly, the SDK's...

1 kudos

Friday

Databricks Community

Forum Posts

Read from multiple sources in a single stream

DLT change in hive metastore destination to unity catalog

Need help with setting up ForEach task in Databricks

using concurrent.futures for parallelization

Facing issues databricks asset bundle, All jobs are getting Deployed into specified targets Instead

Dear experts, need urgent help on logic.

The authentication value "ActiveDirectoryManagedIdentity" is not valid.

Strcutured Streaming with queue in separate storage account

Identify source of data in query

Best practice on how to set up a medallion architecture pipelines inside DAB

RLS function : concat vs list

Long runtimes on simple copying of data

Could not connect Self Hosted MySQL Database in Azure Databricks

Upgraded cluster to 16.1/16.2 and upload data(append) to elastic index is failling

databricks sdk for python authentication failing

Join Us as a Local Community Builder!

Broken s3 file paths in File Notifications for aut...

Reading empty json file in serverless gives error

Drop Delta Log seems not to be working

Conversational Agent App integration with genie in...

Looking for CLI-based SQL formatter for Databricks...