Data Engineering

Forum Posts

Sorted by:

by Klusener • Contributor

01-24-2025 9:37:11 AM

1735 Views
7 replies
11 kudos

Resolved! Out of Memory after adding distinct operation

I have a spark pipeline which reads selected data from a table_1 as view and performs few aggregation via group by in next step and writes to target table. table_1 has large data ~30GB, compressed csv.Step-1:create or replace temporary view base_data...

Data Engineering

1735 Views
7 replies
11 kudos

01-24-2025 9:37:11 AM

View Replies

Latest Reply

MadhuB
Valued Contributor

01-24-2025 10:32:09 AM

11 kudos

Hi @Klusener Distinct is a very expense operation. For your case, I recommend to use either of the below deduplication strategies.Most efficient methoddf_deduped = df.dropDuplicates(subset=['unique_key_columns'])For complex dedupe process - Partition...

11 kudos

01-24-2025 10:32:09 AM

6 More Replies

by lauraxyz • Contributor

01-22-2025 10:58:15 AM

1902 Views
5 replies
0 kudos

Notebook in path workspace/repos/.internal/_commits/ was unable to be accessed

I have a workflow job (source is git) to access a notebook and execute it. From the job, it failed with error:Py4JJavaError: An error occurred while calling o466.run. : com.databricks.WorkflowException: com.databricks.NotebookExecutionException: FAI...

Data Engineering

1902 Views
5 replies
0 kudos

01-22-2025 10:58:15 AM

View Replies

Latest Reply

lauraxyz
Contributor

01-24-2025 3:34:41 PM

0 kudos

Just some clarification: the caller notebook can be found with no issues, no matter the task's source is GIT or WORKSPACE. However, the callee notebook, which is called by the caller notebook with dbutils.notebook.run(), cannot be found if the call...

0 kudos

01-24-2025 3:34:41 PM

4 More Replies

by majo2 • New Contributor II

08-20-2024 9:37:33 AM

3631 Views
2 replies
2 kudos

tqdm progressbar in Databricks jobs

Hi,I'm using Databricks workflows to run a train job using `pytorch` + `lightning`. `lightning` has a built in progressbar built on `tqdm` that tracks the progress. It works OK in when I run the notebook outside of a workflow. But when I try to run n...

Data Engineering

3631 Views
2 replies
2 kudos

08-20-2024 9:37:33 AM

View Replies

Latest Reply

ludovicc
New Contributor II

01-24-2025 12:58:57 PM

2 kudos

I have found that only progressbar2 can work in both interactive notebooks and workflow notebooks. It's limited, but better than nothing. Tqdm is broken in workflows.

2 kudos

01-24-2025 12:58:57 PM

1 More Replies

by Kayla • Valued Contributor II

01-11-2025 7:23:46 AM

901 Views
3 replies
0 kudos

GCP Serverless SQL Warehouse Tag Propagation?

I have a serverless SQL Warehouse with a tag on it that is not making it to GCP.We have various job and AP clusters with tags that I can see in GCP- trying to have everything tagged for the purpose of monitoring billing/usage centrally.Do serverless ...

Data Engineering

901 Views
3 replies
0 kudos

01-11-2025 7:23:46 AM

View Replies

Latest Reply

Isi
Honored Contributor III

01-24-2025 11:06:30 AM

0 kudos

Hey @Kayla ,This is because serverless infrastructure is fully managed by Databricks, and you do not have direct control over the underlying resources as you do with standard clusters (non-serverless SQL warehouse)You can track SQL Warehouse usage wi...

0 kudos

01-24-2025 11:06:30 AM

2 More Replies

by ideal_knee • New Contributor III

12-05-2024 2:22:54 PM

6464 Views
6 replies
8 kudos

Reading an Iceberg table with AWS Glue Data Catalog as metastore

I have created an Iceberg table using AWS Glue, however whenever I try to read it using a Databricks cluster, I get `java.lang.InstantiationException`. I have tried every combination of Spark configs for my Databricks compute cluster that I can think...

Data Engineering

6464 Views
6 replies
8 kudos

12-05-2024 2:22:54 PM

View Replies

Latest Reply

ideal_knee
New Contributor III

01-24-2025 10:31:48 AM

8 kudos

In case someone happens upon this in the future, I ended up using Unity Catalog with Hive metastore federation for Glue. The Iceberg support is currently "coming soon in Public Preview."

8 kudos

01-24-2025 10:31:48 AM

5 More Replies

by kasiviss42 • New Contributor III

12-30-2024 3:34:03 AM

2765 Views
10 replies
2 kudos

Unity Credential Scope id not found in thread locals

i am facing issue :- [UNITY_CREDENTIAL_SCOPE_MISSING_SCOPE] Missing Credential Scope. Unity Credential Scope id not found in thread locals.Issue occurs:-when we try to list files using dbutils.fs.lsand also this occurs at times when we try to write o...

Data Engineering

2765 Views
10 replies
2 kudos

12-30-2024 3:34:03 AM

View Replies

Latest Reply

ashishCh
New Contributor II

01-24-2025 9:02:53 AM

2 kudos

Thanks for the reply.Its working in dbr 15.4 but I want to use it with 13.3, is there a workaround?

2 kudos

01-24-2025 9:02:53 AM

9 More Replies

by Greg_c • New Contributor II

01-16-2025 3:45:57 AM

3124 Views
4 replies
0 kudos

Best practices for ensuring data quality in batch pipelines

Hello everyone,I couldn't find a topic on this - what are your best practices to ensuring data quality in batch pipelines?I've got a big pipeline processing data once per day. We though about either going with DBT or DLT but DLT seems more directed f...

Data Engineering

3124 Views
4 replies
0 kudos

01-16-2025 3:45:57 AM

View Replies

Latest Reply

Isi
Honored Contributor III

01-24-2025 7:21:55 AM

0 kudos

Hey Greg_cI use DBT daily for batch data ingestion, and I believe it’s a great option. However, it’s important to consider that adopting DBT introduces additional complexity, and the team should carefully evaluate the impact of adding a new tool to t...

0 kudos

01-24-2025 7:21:55 AM

3 More Replies

by Phani1 • Valued Contributor II

11-14-2024 11:36:23 PM

2824 Views
5 replies
1 kudos

Cluster idle time and usage details

How can we find out the usage details of the Databricks cluster? Specifically, we need to know how many nodes are in use, how long the cluster is idle, the time it takes to start up, and the jobs it is running along with their durations. Is there a q...

Data Engineering

2824 Views
5 replies
1 kudos

11-14-2024 11:36:23 PM

View Replies

Latest Reply

Isi
Honored Contributor III

01-24-2025 7:03:52 AM

1 kudos

Hey @hboleto It’s difficult to accurately estimate the final cost of a Serverless cluster, as it is fully managed by Databricks. In contrast, Classic clusters allow for finer resource tuning since you can define spot instances and other instance type...

1 kudos

01-24-2025 7:03:52 AM

4 More Replies

by alexu4798644233 • New Contributor III

01-24-2025 5:22:11 AM

1285 Views
1 replies
0 kudos

ETL or Transformations Testing Framework for Databricks

Hi! I'm looking for any ETL or Transformations Testing Framework for Databricks -need to support automation of the following steps:1) create/store test datasets (mock inputs and a golden copy of the output),2) run ETL (notebook) being tested3) compar...

Data Engineering

1285 Views
1 replies
0 kudos

01-24-2025 5:22:11 AM

View Replies

Latest Reply

Rjdudley
Honored Contributor

01-24-2025 6:10:11 AM

0 kudos

You can do all of this yourself with a testing workflow. You can create your data in a notebook or keeping a backup copy of tables, and copy them fresh for your tests. This would be the first step of the workflow. Then call your notebooks. Your c...

0 kudos

01-24-2025 6:10:11 AM

by Divya_sreeE • New Contributor

01-22-2025 10:21:04 PM

539 Views
1 replies
0 kudos

Unable to pass the task variables from Python Wheel to ForEach task

I understand that task variables are supported in Databricks notebook , but there is a requirement from client to use python wheel package in Databricks workflow . We are not able to set the task variables using dbutils in python wheel file. Kindly s...

Data Engineering

539 Views
1 replies
0 kudos

01-22-2025 10:21:04 PM

View Replies

Latest Reply

saurabh18cs
Honored Contributor II

01-24-2025 5:11:52 AM

0 kudos

Hi @Divya_sreeE you can pass dynamic variables between tasks using Databricks' job parameters.1) In your first Python wheel task, generate the dynamic variables and use the Databricks REST API to update the job parameters.2) In the For Each loop, ret...

0 kudos

01-24-2025 5:11:52 AM

by jasperputs • New Contributor III

06-20-2022 6:54:15 AM

9709 Views
5 replies
3 kudos

Resolved! Add Identity Column to Existing Table

Hello everyone. I am working with tables that need an identity column. I currently have a view in which I cast the different columns to the data type that I want. Now I want the result of this view to be inserted or merged into a table. The schema of...

Data Engineering

9709 Views
5 replies
3 kudos

06-20-2022 6:54:15 AM

View Replies

Latest Reply

ramankr48
Contributor II

09-13-2022 12:10:25 AM

3 kudos

Hello @Jasper Puts how did you solve this issue of creating a identity column to existing table.I'm also getting the same error as you got.

3 kudos

09-13-2022 12:10:25 AM

4 More Replies

by WYO • New Contributor II

01-23-2025 6:05:25 AM

1989 Views
3 replies
1 kudos

Export data from databricks to prem

Hello everyoneI need to export some data to sql server management studio on premise.I need to verify that the new data on databricks is aligned with the older data that we have on premise.Is it possible to export data as an Excel sheet or .csv file?R...

Data Engineering

1989 Views
3 replies
1 kudos

01-23-2025 6:05:25 AM

View Replies

Latest Reply

Avinash_Narala
Valued Contributor II

01-24-2025 1:22:00 AM

1 kudos

You can compare your databricks data with on-prem sql server data in two ways:Firstly, you have to make connection between sql-server and databricks using volumes. Using volumes we can mount sql server data into databricks unity catalog1.we can comp...

1 kudos

01-24-2025 1:22:00 AM

2 More Replies

by sahil_s_jain • New Contributor III

01-21-2025 12:18:17 AM

4575 Views
2 replies
1 kudos

Resolved! Databricks 13.3LTS to 15.4 LTS Migration - Spark job with source DB2 database not working

I'm trying to migrate a spark job from Databricks 13.3 LTS to 15.4 LTS. The spark job uses db2jcc4.jar for DB2 database connection. Below is the spark code:%scala // Import Spark SQL import org.apache.spark.sql.SparkSession // Create a Spark session...

Data Engineering

4575 Views
2 replies
1 kudos

01-21-2025 12:18:17 AM

View Replies

Latest Reply

sahil_s_jain
New Contributor III

01-23-2025 10:51:50 PM

1 kudos

Thanks for the solution.

1 kudos

01-23-2025 10:51:50 PM

1 More Replies

by mikeagicman • New Contributor

04-08-2024 4:43:57 AM

2456 Views
1 replies
0 kudos

Handling Unknown Fields in DLT Pipeline

HiI'm working on a DLT pipeline where I read JSON files stored in S3.I'm using the auto loader to identify the file schema and adding schema hints for some fields to specify their type.When running it against a single data file that contains addition...

Data Engineering

2456 Views
1 replies
0 kudos

04-08-2024 4:43:57 AM

View Replies

Latest Reply

jb1z
Contributor

01-23-2025 5:21:18 PM

0 kudos

Hi community and @mikeagicman i saw this error when trying to load a json file. I discovered the problem was that the schemaLocation i was using was pointing to a different table schema, so it was trying to match columns that did not exist. When i se...

0 kudos

01-23-2025 5:21:18 PM

by tessaickx • New Contributor III

04-07-2023 4:18:05 AM

4274 Views
4 replies
4 kudos

Using ipywidgets latest versions

Hello everyone,I upgraded my cluster to DBR 13.0, which comes with ipywidgets version 7.7.2 installed.However, I want to use the TagsInput widget, which is new since version 8.0.4.If i upgrade the ipywidgets package to version 8.0.4, none of the widg...

Data Engineering

4274 Views
4 replies
4 kudos

04-07-2023 4:18:05 AM

View Replies

Latest Reply

pmd84
New Contributor II

01-23-2025 10:33:30 AM

4 kudos

I can confirm that installing a newer ipywidgets library version at a cluster level does not resolve these issues. The arcgis library relies on ipywidgets v8 to render maps. Even when I install ipywidgets > 8 at the cluster level, the widgets still d...

4 kudos

01-23-2025 10:33:30 AM

3 More Replies

Databricks Community

Forum Posts

Resolved! Out of Memory after adding distinct operation

Notebook in path workspace/repos/.internal/_commits/ was unable to be accessed

tqdm progressbar in Databricks jobs

GCP Serverless SQL Warehouse Tag Propagation?

Reading an Iceberg table with AWS Glue Data Catalog as metastore

Unity Credential Scope id not found in thread locals

Best practices for ensuring data quality in batch pipelines

Cluster idle time and usage details

ETL or Transformations Testing Framework for Databricks

Unable to pass the task variables from Python Wheel to ForEach task

Resolved! Add Identity Column to Existing Table

Export data from databricks to prem

Resolved! Databricks 13.3LTS to 15.4 LTS Migration - Spark job with source DB2 database not working

Handling Unknown Fields in DLT Pipeline

Using ipywidgets latest versions

Join Us as a Local Community Builder!

Issue with Lakebridge transpile installation – SSL...

Spark JDBC Netsuite error - SQLSyntaxErrorExcepti...

Syncing lakebase table to delta table

Online Table Migration

How can I execute a Spark SQL query inside a Unity...