cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Jonathan_
by New Contributor II
  • 129 Views
  • 4 replies
  • 6 kudos

Slow PySpark operations after long DAG that contains many joins and transformations

We are using PySpark and notice that when we are doing many transformations/aggregations/joins of the data then at some point the execution time of simple task (count, display, union of 2 tables, ...) become very slow even if we have a small data (ex...

  • 129 Views
  • 4 replies
  • 6 kudos
Latest Reply
tarunnagar
New Contributor
  • 6 kudos

This is a pretty common issue with PySpark when working on large DAGs with lots of joins and transformations. As the DAG grows, Spark has to maintain a huge execution plan, and performance can drop due to shuffling, serialization, and memory overhead...

  • 6 kudos
3 More Replies
mikvaar
by New Contributor III
  • 577 Views
  • 8 replies
  • 3 kudos

DAB + DLT destroy fails due to ownership/permissions mismatch

Hi all,We are running into an issue with Databricks Asset Bundles (DAB) when trying to destroy a DLT pipeline. Setup is as follows:Two separate service principals:Deployment SP: used by Azure DevOps for deploying bundles.Run_as SP: used for running t...

Data Engineering
Databricks
Databricks Asset Bundles
DevOps
  • 577 Views
  • 8 replies
  • 3 kudos
Latest Reply
denis-dbx
Databricks Employee
  • 3 kudos

We just released https://github.com/databricks/cli/releases/tag/v0.273.0 with a mitigation for this, the error should disappear if you upgrade. Please try and let us know how it goes. Terraform fix is in https://github.com/databricks/terraform-provid...

  • 3 kudos
7 More Replies
Dimitry
by Contributor III
  • 13 Views
  • 1 replies
  • 0 kudos

Serverless - can't parallelize UDF in applyInPandas

HI allServerless V3 solved an error of mismatching python versions between driver and worker which I had on V2 (can't remember the exact wording).So I'd been running this on classic compute so far.Today I tried on serverless to a partial success - un...

Dimitry_1-1760679790069.png Dimitry_2-1760679824765.png
  • 13 Views
  • 1 replies
  • 0 kudos
Latest Reply
Dimitry
Contributor III
  • 0 kudos

I was wrong in interpreting the results. threading.get_native_id() does not work on serverless as on classic, so different threads return the same ID. The time it takes to execute the test is obviously less than 40 seconds, if it was running on a sin...

  • 0 kudos
bunny1174
by New Contributor
  • 93 Views
  • 2 replies
  • 1 kudos

Spark Streaming Loading 1kto 5k rows only delta table

Hi Team,I have 4-5 millions of files in s3 files around 1.5gb data only with 9 million records, when i try to use autoloader to read the data using read stream and writing to delta table the processing is taking too much time, it is loading from 1k t...

  • 93 Views
  • 2 replies
  • 1 kudos
Latest Reply
Prajapathy_NKR
  • 1 kudos

@bunny1174 It is a common issue that small files gets created during streaming. Since you are using delta file format, I would suggest two solutions,1. try using Liquid clustering. This does auto compact of small files into a bigger chuck mostly of 1...

  • 1 kudos
1 More Replies
SuMiT1
by New Contributor III
  • 427 Views
  • 9 replies
  • 4 kudos

Flattening the json in databricks

I have chatbot data  I read adls json file in databricks and i stored the output in dataframeIn that table two columns contains json data but the data type is string1.content2.metadata Now i have to flatten the.data but i am not getting how to do tha...

  • 427 Views
  • 9 replies
  • 4 kudos
Latest Reply
Prajapathy_NKR
  • 4 kudos

@szymon_dybczak your solution was crisp.@SuMiT1 since you have mentioned your json is dynamic, get one of your json body into a variable. json_body = df.select("content").take(1).collect(0)then get the schema of the json,schema = schema_of_json(json_...

  • 4 kudos
8 More Replies
Hritik_Moon
by New Contributor II
  • 93 Views
  • 2 replies
  • 2 kudos

Reading snappy.parquet

I stored a dataframe as delta in the catalog. It created multiple folders with snappy.parquet files. Is there a way to read these snappy.parquet files.it reads with pandas but with spark it gives error "incompatible format"

  • 93 Views
  • 2 replies
  • 2 kudos
Latest Reply
Prajapathy_NKR
  • 2 kudos

@Hritik_Moon Try to read the file as delta. path/delta_file_name/- parquet files- delta_log/since you are using spark, use this, spark.read.format("delta").load("path/delta_file_name").Delta internally stores the data as parquet and delta log contain...

  • 2 kudos
1 More Replies
Hritik_Moon
by New Contributor II
  • 219 Views
  • 6 replies
  • 8 kudos

Stop Cache in free edition

Hello,I am using databricks free edition, is there a way to turn off IO caching.I am trying to learn optimization and cant see any difference in query run time with caching enabled.

  • 219 Views
  • 6 replies
  • 8 kudos
Latest Reply
Prajapathy_NKR
  • 8 kudos

@Hritik_Moon 1. check if your data is cached, this you can see in sparkUI > storage tab.2. if it is not cached, try to add a action statement after you cache. eg : df.count(). Data is cached with the first action statement it encounters. Now check in...

  • 8 kudos
5 More Replies
Ajay-Pandey
by Esteemed Contributor III
  • 2838 Views
  • 8 replies
  • 2 kudos

Databricks Job cluster for continuous run

Hi AllI am having situation where I wanted to run job as continuous trigger by using job cluster, cluster terminating and re-creating in every run within continuous trigger.I just wanted two know if we have any option where I can use same job cluster...

AjayPandey_0-1728973783760.png
  • 2838 Views
  • 8 replies
  • 2 kudos
Latest Reply
Zaranders
Visitor
  • 2 kudos

This is a great initiative! As a data engineer, I always appreciate learning new optimization strategies. Recently, I stumbled upon Monkey Mart while researching resource-efficient architectures—funny how inspiration comes from unexpected places. Loo...

  • 2 kudos
7 More Replies
xx123
by New Contributor III
  • 1732 Views
  • 1 replies
  • 0 kudos

Comparing Databricks Serverless Warehouse with Snowflake Virtual Warehouse for specific query

Hey,I would like to compare the runtime of one specific query by running it on Databricks Serverless Warehouse and Snowflake Virtual Warehouse.I create table with the exact same structure with the exact same dataset in both Warehouses.the dataset if ...

  • 1732 Views
  • 1 replies
  • 0 kudos
Latest Reply
Krishna_S
Databricks Employee
  • 0 kudos

  You’re running into a Databricks SQL results delivery limit—the UI (and even “Download results”) isn’t meant to stream 1.5M × (id, name, 5,000-double array) back to your browser. That’s why SELECT * “works” on Snowflake’s console but not in the DBS...

  • 0 kudos
KKo
by Contributor III
  • 29 Views
  • 1 replies
  • 0 kudos

DDL script to upper environment

I have multiple databases created in unity catalog in a DEV databricks workspace, I used databricks UI/notebook and ran scripts to do it. Now, I want to have those databases in QA and PROD workspaces as well. What is the best way to run those DDLs in...

  • 29 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @KKo ,The simplest way is to have a parametrized notebook which you can pass a name of your catalog as your parameter. Then you can use that parameter to prepare appropriate SQL statements responsible for creating catalogs/schemas/tables.Alternati...

  • 0 kudos
Bhavana_Y
by New Contributor
  • 32 Views
  • 0 replies
  • 0 kudos

Learning Path for Spark Developer Associate

Hello Everyone,Happy for being a part of Virtual Journey !!Enrolled in Associate Spark Developer and completed learning path in Databricks Academy. Can anyone please confirm is completing learning path enough for obtaining 50% off voucher for certifi...

Screenshot (15).png
  • 32 Views
  • 0 replies
  • 0 kudos
ckough
by New Contributor III
  • 54782 Views
  • 47 replies
  • 25 kudos

Resolved! Cannot sign in at databricks partner-academy portal

Hi thereI have used my company email to register an account for customer-academy.databricks.com a while back. Now what I need to do is create an account with partner-academy.databricks.com using my company email too.However when I register at partner...

  • 54782 Views
  • 47 replies
  • 25 kudos
Latest Reply
cpelletier360
New Contributor
  • 25 kudos

Also facing the same issue. I will log a ticket.

  • 25 kudos
46 More Replies
elliottatreef
by New Contributor
  • 77 Views
  • 3 replies
  • 1 kudos

Serverless environment not respecting environment spec on run_job_task

When running a job via a `run_job_task`, the job triggered is not using the specified serverless environment. I've configured my job to use serverless `environment_version` "3" with a dependency built into my workspace, but whenever I run the job, it...

Screenshot 2025-10-15 at 11.40.45 AM.png Screenshot 2025-10-15 at 11.43.39 AM.png
  • 77 Views
  • 3 replies
  • 1 kudos
Latest Reply
MuthuLakshmi
Databricks Employee
  • 1 kudos

@elliottatreef Can you try to set the Environment version on the source notebook and then trigger the job?On notebook -> Serverless -> configuration -> Environment version drop down. Then, in your job, making sure it’s assigning to the Serverless com...

  • 1 kudos
2 More Replies
donlxz
by New Contributor III
  • 103 Views
  • 3 replies
  • 3 kudos

deadlock occurs with use statement

When issuing a query from Informatica using a Delta connection, the statement use catalog_name.schema_name is executed first. At that time, the following error appeared in the query history:Query could not be scheduled: (conn=5073499)Deadlock found w...

  • 103 Views
  • 3 replies
  • 3 kudos
Latest Reply
donlxz
New Contributor III
  • 3 kudos

Hi @ManojkMohan Thank you for your response.I understand that adjustments are needed on the Informatica side, and I’ll ask them to review the deadlock retry settings.Is there anything that can be changed or configured on the Databricks side to help w...

  • 3 kudos
2 More Replies
Mous92i
by New Contributor
  • 105 Views
  • 2 replies
  • 0 kudos

Liquid Clustering With Merge

Hello I’m facing severe performance issues with a  merge into databricksmerge_condition = """ source.data_hierarchy = target.data_hierarchy AND source.sensor_id = target.sensor_id AND source.timestamp = target.timestamp """The target Delt...

  • 105 Views
  • 2 replies
  • 0 kudos
Latest Reply
K_Anudeep
Databricks Employee
  • 0 kudos

Hi @Mous92i  DFP is what pushes source filters down to the target to skip files. For MERGE/UPDATE/DELETE, DFP only works on Photon-enabled compute. If you’re not on Photon, MERGE will scan everything.Enabling Liquid Clustering doesn’t recluster past ...

  • 0 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels