Data Engineering

Forum Posts

Sorted by:

by Brahmareddy • Esteemed Contributor

2 weeks ago

159 Views
1 replies
4 kudos

I Tried Teaching Databricks About Itself — Here’s What Happened

Hi All, How are you doing today?I wanted to share something interesting from my recent Databricks work — I’ve been playing around with an idea I call “Real-Time Metadata Intelligence.” Most of us focus on optimizing data pipelines, query performance,...

Data Engineering

159 Views
1 replies
4 kudos

2 weeks ago

View Replies

Latest Reply

ruicarvalho_de
New Contributor II

Friday

4 kudos

I like the core idea. You are mining signals the platform already emits.I would start rules first, track small files ratio and average file size trend, watch skew per partition and shuffle bytes per input gigabyte. Compare job time to input size to c...

4 kudos

Friday

by Bhavana_Y • New Contributor

Thursday

101 Views
1 replies
1 kudos

Learning Path for Spark Developer Associate

Hello Everyone,Happy for being a part of Virtual Journey !!Enrolled in Associate Spark Developer and completed learning path in Databricks Academy. Can anyone please confirm is completing learning path enough for obtaining 50% off voucher for certifi...

Data Engineering

101 Views
1 replies
1 kudos

Thursday

View Replies

Latest Reply

Advika
Databricks Employee

Friday

1 kudos

Hello @Bhavana_Y! To be eligible for the incentives, you’ll need to complete one of the pathways mentioned in the Learning Festival post. Based on your screenshot, it looks like you’ve completed all four modules of LEARNING PATHWAY 7: APACHE SPARK DE...

1 kudos

Friday

by donlxz • New Contributor III

Wednesday

271 Views
4 replies
3 kudos

Resolved! deadlock occurs with use statement

When issuing a query from Informatica using a Delta connection, the statement use catalog_name.schema_name is executed first. At that time, the following error appeared in the query history:Query could not be scheduled: (conn=5073499)Deadlock found w...

Data Engineering

271 Views
4 replies
3 kudos

Wednesday

View Replies

Latest Reply

donlxz
New Contributor III

Friday

3 kudos

I’ll try making adjustments on the Informatica side.Thank you for your help.

3 kudos

Friday

3 More Replies

by mikvaar • New Contributor III

09-16-2025 4:11:32 AM

727 Views
8 replies
6 kudos

Resolved! DAB + DLT destroy fails due to ownership/permissions mismatch

Hi all,We are running into an issue with Databricks Asset Bundles (DAB) when trying to destroy a DLT pipeline. Setup is as follows:Two separate service principals:Deployment SP: used by Azure DevOps for deploying bundles.Run_as SP: used for running t...

Data Engineering

Databricks

Databricks Asset Bundles

DevOps

727 Views
8 replies
6 kudos

09-16-2025 4:11:32 AM

View Replies

Latest Reply

denis-dbx
Databricks Employee

Friday

6 kudos

We just released https://github.com/databricks/cli/releases/tag/v0.273.0 with a mitigation for this, the error should disappear if you upgrade. Please try and let us know how it goes. Terraform fix is in https://github.com/databricks/terraform-provid...

6 kudos

Friday

7 More Replies

by Dimitry • Contributor III

Thursday

83 Views
1 replies
0 kudos

Serverless - can't parallelize UDF in applyInPandas

HI allServerless V3 solved an error of mismatching python versions between driver and worker which I had on V2 (can't remember the exact wording).So I'd been running this on classic compute so far.Today I tried on serverless to a partial success - un...

Data Engineering

83 Views
1 replies
0 kudos

Thursday

View Replies

Latest Reply

Dimitry
Contributor III

Thursday

0 kudos

I was wrong in interpreting the results. threading.get_native_id() does not work on serverless as on classic, so different threads return the same ID. The time it takes to execute the test is obviously less than 40 seconds, if it was running on a sin...

0 kudos

Thursday

by bunny1174 • New Contributor

2 weeks ago

142 Views
2 replies
1 kudos

Spark Streaming Loading 1kto 5k rows only delta table

Hi Team,I have 4-5 millions of files in s3 files around 1.5gb data only with 9 million records, when i try to use autoloader to read the data using read stream and writing to delta table the processing is taking too much time, it is loading from 1k t...

Data Engineering

142 Views
2 replies
1 kudos

2 weeks ago

View Replies

Latest Reply

Prajapathy_NKR
New Contributor

Thursday

1 kudos

@bunny1174 It is a common issue that small files gets created during streaming. Since you are using delta file format, I would suggest two solutions,1. try using Liquid clustering. This does auto compact of small files into a bigger chuck mostly of 1...

1 kudos

Thursday

1 More Replies

by SuMiT1 • New Contributor III

2 weeks ago

499 Views
9 replies
4 kudos

Flattening the json in databricks

I have chatbot data I read adls json file in databricks and i stored the output in dataframeIn that table two columns contains json data but the data type is string1.content2.metadata Now i have to flatten the.data but i am not getting how to do tha...

Data Engineering

499 Views
9 replies
4 kudos

2 weeks ago

View Replies

Latest Reply

Prajapathy_NKR
New Contributor

Thursday

4 kudos

@szymon_dybczak your solution was crisp.@SuMiT1 since you have mentioned your json is dynamic, get one of your json body into a variable. json_body = df.select("content").take(1).collect(0)then get the schema of the json,schema = schema_of_json(json_...

4 kudos

Thursday

8 More Replies

by Hritik_Moon • New Contributor II

a week ago

154 Views
2 replies
2 kudos

Reading snappy.parquet

I stored a dataframe as delta in the catalog. It created multiple folders with snappy.parquet files. Is there a way to read these snappy.parquet files.it reads with pandas but with spark it gives error "incompatible format"

Data Engineering

154 Views
2 replies
2 kudos

a week ago

View Replies

Latest Reply

Prajapathy_NKR
New Contributor

Thursday

2 kudos

@Hritik_Moon Try to read the file as delta. path/delta_file_name/- parquet files- delta_log/since you are using spark, use this, spark.read.format("delta").load("path/delta_file_name").Delta internally stores the data as parquet and delta log contain...

2 kudos

Thursday

1 More Replies

by Hritik_Moon • New Contributor II

a week ago

299 Views
6 replies
8 kudos

Stop Cache in free edition

Hello,I am using databricks free edition, is there a way to turn off IO caching.I am trying to learn optimization and cant see any difference in query run time with caching enabled.

Data Engineering

299 Views
6 replies
8 kudos

a week ago

View Replies

Latest Reply

Prajapathy_NKR
New Contributor

Thursday

8 kudos

@Hritik_Moon 1. check if your data is cached, this you can see in sparkUI > storage tab.2. if it is not cached, try to add a action statement after you cache. eg : df.count(). Data is cached with the first action statement it encounters. Now check in...

8 kudos

Thursday

5 More Replies

by Ajay-Pandey • Esteemed Contributor III

10-14-2024 11:23:46 PM

2880 Views
8 replies
2 kudos

Databricks Job cluster for continuous run

Hi AllI am having situation where I wanted to run job as continuous trigger by using job cluster, cluster terminating and re-creating in every run within continuous trigger.I just wanted two know if we have any option where I can use same job cluster...

Data Engineering

2880 Views
8 replies
2 kudos

10-14-2024 11:23:46 PM

View Replies

Latest Reply

Zaranders
New Contributor

Thursday

2 kudos

This is a great initiative! As a data engineer, I always appreciate learning new optimization strategies. Recently, I stumbled upon Monkey Mart while researching resource-efficient architectures—funny how inspiration comes from unexpected places. Loo...

2 kudos

Thursday

7 More Replies

by xx123 • New Contributor III

06-25-2025 1:10:57 PM

1784 Views
1 replies
1 kudos

Comparing Databricks Serverless Warehouse with Snowflake Virtual Warehouse for specific query

Hey,I would like to compare the runtime of one specific query by running it on Databricks Serverless Warehouse and Snowflake Virtual Warehouse.I create table with the exact same structure with the exact same dataset in both Warehouses.the dataset if ...

Data Engineering

1784 Views
1 replies
1 kudos

06-25-2025 1:10:57 PM

View Replies

Latest Reply

Krishna_S
Databricks Employee

Thursday

1 kudos

You’re running into a Databricks SQL results delivery limit—the UI (and even “Download results”) isn’t meant to stream 1.5M × (id, name, 5,000-double array) back to your browser. That’s why SELECT * “works” on Snowflake’s console but not in the DBS...

1 kudos

Thursday

by KKo • Contributor III

Thursday

101 Views
1 replies
1 kudos

DDL script to upper environment

I have multiple databases created in unity catalog in a DEV databricks workspace, I used databricks UI/notebook and ran scripts to do it. Now, I want to have those databases in QA and PROD workspaces as well. What is the best way to run those DDLs in...

Data Engineering

101 Views
1 replies
1 kudos

Thursday

View Replies

Latest Reply

szymon_dybczak
Esteemed Contributor III

Thursday

1 kudos

Hi @KKo ,The simplest way is to have a parametrized notebook which you can pass a name of your catalog as your parameter. Then you can use that parameter to prepare appropriate SQL statements responsible for creating catalogs/schemas/tables.Alternati...

1 kudos

Thursday

by ckough • New Contributor III

02-02-2022 10:33:26 PM

54901 Views
47 replies
25 kudos

Resolved! Cannot sign in at databricks partner-academy portal

Hi thereI have used my company email to register an account for customer-academy.databricks.com a while back. Now what I need to do is create an account with partner-academy.databricks.com using my company email too.However when I register at partner...

Data Engineering

54901 Views
47 replies
25 kudos

02-02-2022 10:33:26 PM

View Replies

Latest Reply

cpelletier360
New Contributor

Thursday

25 kudos

Also facing the same issue. I will log a ticket.

25 kudos

Thursday

46 More Replies

by elliottatreef • New Contributor

Wednesday

186 Views
3 replies
1 kudos

Serverless environment not respecting environment spec on run_job_task

When running a job via a `run_job_task`, the job triggered is not using the specified serverless environment. I've configured my job to use serverless `environment_version` "3" with a dependency built into my workspace, but whenever I run the job, it...

Screenshot 2025-10-15 at 11.40.45 AM.png

Screenshot 2025-10-15 at 11.43.39 AM.png

Data Engineering

186 Views
3 replies
1 kudos

Wednesday

View Replies

Latest Reply

MuthuLakshmi
Databricks Employee

Thursday

1 kudos

@elliottatreef Can you try to set the Environment version on the source notebook and then trigger the job?On notebook -> Serverless -> configuration -> Environment version drop down. Then, in your job, making sure it’s assigning to the Serverless com...

1 kudos

Thursday

2 More Replies

by georgemichael40 • New Contributor III

Wednesday

239 Views
4 replies
5 kudos

Resolved! Python Wheel in Serverless Job in DAB

Hey,I am trying to run a job with serverless compute, that runs python scripts.I need the paramiko package to get my scripts to work. I managed to get it working by doing:environments:- environment_key: default# Full documentation of this spec can be...

Data Engineering

239 Views
4 replies
5 kudos

Wednesday

View Replies

Latest Reply

szymon_dybczak
Esteemed Contributor III

Wednesday

5 kudos

Hi @georgemichael40 ,Put your whl file in the volume and then you can reference it in following way in your DAB file:dependencies: - " /Volumes/workspace/default/my_volume/hellopkg-0.0.1-py3-none-any.whl"https://docs.databricks.com/aws/en/compute/s...

5 kudos

Wednesday

3 More Replies

Databricks Community

Forum Posts

I Tried Teaching Databricks About Itself — Here’s What Happened

Learning Path for Spark Developer Associate

Resolved! deadlock occurs with use statement

Resolved! DAB + DLT destroy fails due to ownership/permissions mismatch

Serverless - can't parallelize UDF in applyInPandas

Spark Streaming Loading 1kto 5k rows only delta table

Flattening the json in databricks

Reading snappy.parquet

Stop Cache in free edition

Databricks Job cluster for continuous run

Comparing Databricks Serverless Warehouse with Snowflake Virtual Warehouse for specific query

DDL script to upper environment

Resolved! Cannot sign in at databricks partner-academy portal

Serverless environment not respecting environment spec on run_job_task

Resolved! Python Wheel in Serverless Job in DAB

Join Us as a Local Community Builder!

Understanding least common type in databricks

Least Common Type is different in Serverless and A...

Figure out stale tables/folders being loaded by au...

Cannot import pyspark.pipelines module

How to make FOR cycle and dynamic SQL and variable...