Data Engineering

Forum Posts

Sorted by:

by Dimitry • Contributor III

Thursday

44 Views
1 replies
0 kudos

Serverless - can't parallelize UDF in applyInPandas

HI allServerless V3 solved an error of mismatching python versions between driver and worker which I had on V2 (can't remember the exact wording).So I'd been running this on classic compute so far.Today I tried on serverless to a partial success - un...

Data Engineering

44 Views
1 replies
0 kudos

Thursday

View Replies

Latest Reply

Dimitry
Contributor III

Thursday

0 kudos

I was wrong in interpreting the results. threading.get_native_id() does not work on serverless as on classic, so different threads return the same ID. The time it takes to execute the test is obviously less than 40 seconds, if it was running on a sin...

0 kudos

Thursday

by bunny1174 • New Contributor

a week ago

119 Views
2 replies
1 kudos

Spark Streaming Loading 1kto 5k rows only delta table

Hi Team,I have 4-5 millions of files in s3 files around 1.5gb data only with 9 million records, when i try to use autoloader to read the data using read stream and writing to delta table the processing is taking too much time, it is loading from 1k t...

Data Engineering

119 Views
2 replies
1 kudos

a week ago

View Replies

Latest Reply

Prajapathy_NKR
New Contributor

Thursday

1 kudos

@bunny1174 It is a common issue that small files gets created during streaming. Since you are using delta file format, I would suggest two solutions,1. try using Liquid clustering. This does auto compact of small files into a bigger chuck mostly of 1...

1 kudos

Thursday

1 More Replies

by SuMiT1 • New Contributor III

a week ago

460 Views
9 replies
4 kudos

Flattening the json in databricks

I have chatbot data I read adls json file in databricks and i stored the output in dataframeIn that table two columns contains json data but the data type is string1.content2.metadata Now i have to flatten the.data but i am not getting how to do tha...

Data Engineering

460 Views
9 replies
4 kudos

a week ago

View Replies

Latest Reply

Prajapathy_NKR
New Contributor

Thursday

4 kudos

@szymon_dybczak your solution was crisp.@SuMiT1 since you have mentioned your json is dynamic, get one of your json body into a variable. json_body = df.select("content").take(1).collect(0)then get the schema of the json,schema = schema_of_json(json_...

4 kudos

Thursday

8 More Replies

by Hritik_Moon • New Contributor II

Monday

132 Views
2 replies
2 kudos

Reading snappy.parquet

I stored a dataframe as delta in the catalog. It created multiple folders with snappy.parquet files. Is there a way to read these snappy.parquet files.it reads with pandas but with spark it gives error "incompatible format"

Data Engineering

132 Views
2 replies
2 kudos

Monday

View Replies

Latest Reply

Prajapathy_NKR
New Contributor

Thursday

2 kudos

@Hritik_Moon Try to read the file as delta. path/delta_file_name/- parquet files- delta_log/since you are using spark, use this, spark.read.format("delta").load("path/delta_file_name").Delta internally stores the data as parquet and delta log contain...

2 kudos

Thursday

1 More Replies

by Hritik_Moon • New Contributor II

Monday

259 Views
6 replies
8 kudos

Stop Cache in free edition

Hello,I am using databricks free edition, is there a way to turn off IO caching.I am trying to learn optimization and cant see any difference in query run time with caching enabled.

Data Engineering

259 Views
6 replies
8 kudos

Monday

View Replies

Latest Reply

Prajapathy_NKR
New Contributor

Thursday

8 kudos

@Hritik_Moon 1. check if your data is cached, this you can see in sparkUI > storage tab.2. if it is not cached, try to add a action statement after you cache. eg : df.count(). Data is cached with the first action statement it encounters. Now check in...

8 kudos

Thursday

5 More Replies

by Ajay-Pandey • Esteemed Contributor III

10-14-2024 11:23:46 PM

2855 Views
8 replies
2 kudos

Databricks Job cluster for continuous run

Hi AllI am having situation where I wanted to run job as continuous trigger by using job cluster, cluster terminating and re-creating in every run within continuous trigger.I just wanted two know if we have any option where I can use same job cluster...

Data Engineering

2855 Views
8 replies
2 kudos

10-14-2024 11:23:46 PM

View Replies

Latest Reply

Zaranders
New Contributor

Thursday

2 kudos

This is a great initiative! As a data engineer, I always appreciate learning new optimization strategies. Recently, I stumbled upon Monkey Mart while researching resource-efficient architectures—funny how inspiration comes from unexpected places. Loo...

2 kudos

Thursday

7 More Replies

by xx123 • New Contributor III

06-25-2025 1:10:57 PM

1770 Views
1 replies
1 kudos

Comparing Databricks Serverless Warehouse with Snowflake Virtual Warehouse for specific query

Hey,I would like to compare the runtime of one specific query by running it on Databricks Serverless Warehouse and Snowflake Virtual Warehouse.I create table with the exact same structure with the exact same dataset in both Warehouses.the dataset if ...

Data Engineering

1770 Views
1 replies
1 kudos

06-25-2025 1:10:57 PM

View Replies

Latest Reply

Krishna_S
Databricks Employee

Thursday

1 kudos

You’re running into a Databricks SQL results delivery limit—the UI (and even “Download results”) isn’t meant to stream 1.5M × (id, name, 5,000-double array) back to your browser. That’s why SELECT * “works” on Snowflake’s console but not in the DBS...

1 kudos

Thursday

by KKo • Contributor III

Thursday

57 Views
1 replies
1 kudos

DDL script to upper environment

I have multiple databases created in unity catalog in a DEV databricks workspace, I used databricks UI/notebook and ran scripts to do it. Now, I want to have those databases in QA and PROD workspaces as well. What is the best way to run those DDLs in...

Data Engineering

57 Views
1 replies
1 kudos

Thursday

View Replies

Latest Reply

szymon_dybczak
Esteemed Contributor III

Thursday

1 kudos

Hi @KKo ,The simplest way is to have a parametrized notebook which you can pass a name of your catalog as your parameter. Then you can use that parameter to prepare appropriate SQL statements responsible for creating catalogs/schemas/tables.Alternati...

1 kudos

Thursday

by ckough • New Contributor III

02-02-2022 10:33:26 PM

54797 Views
47 replies
25 kudos

Resolved! Cannot sign in at databricks partner-academy portal

Hi thereI have used my company email to register an account for customer-academy.databricks.com a while back. Now what I need to do is create an account with partner-academy.databricks.com using my company email too.However when I register at partner...

Data Engineering

54797 Views
47 replies
25 kudos

02-02-2022 10:33:26 PM

View Replies

Latest Reply

cpelletier360
New Contributor

Thursday

25 kudos

Also facing the same issue. I will log a ticket.

25 kudos

Thursday

46 More Replies

by elliottatreef • New Contributor

Wednesday

119 Views
3 replies
1 kudos

Serverless environment not respecting environment spec on run_job_task

When running a job via a `run_job_task`, the job triggered is not using the specified serverless environment. I've configured my job to use serverless `environment_version` "3" with a dependency built into my workspace, but whenever I run the job, it...

Screenshot 2025-10-15 at 11.40.45 AM.png

Screenshot 2025-10-15 at 11.43.39 AM.png

Data Engineering

119 Views
3 replies
1 kudos

Wednesday

View Replies

Latest Reply

MuthuLakshmi
Databricks Employee

Thursday

1 kudos

@elliottatreef Can you try to set the Environment version on the source notebook and then trigger the job?On notebook -> Serverless -> configuration -> Environment version drop down. Then, in your job, making sure it’s assigning to the Serverless com...

1 kudos

Thursday

2 More Replies

by georgemichael40 • New Contributor III

Wednesday

172 Views
4 replies
5 kudos

Resolved! Python Wheel in Serverless Job in DAB

Hey,I am trying to run a job with serverless compute, that runs python scripts.I need the paramiko package to get my scripts to work. I managed to get it working by doing:environments:- environment_key: default# Full documentation of this spec can be...

Data Engineering

172 Views
4 replies
5 kudos

Wednesday

View Replies

Latest Reply

szymon_dybczak
Esteemed Contributor III

Wednesday

5 kudos

Hi @georgemichael40 ,Put your whl file in the volume and then you can reference it in following way in your DAB file:dependencies: - " /Volumes/workspace/default/my_volume/hellopkg-0.0.1-py3-none-any.whl"https://docs.databricks.com/aws/en/compute/s...

5 kudos

Wednesday

3 More Replies

by dndeng • New Contributor

Thursday

65 Views
2 replies
0 kudos

Query to calculate cost of task from each job by day

I am trying to find the cost per Task in each Job every time it was executed (daily) but currently getting very huge numbers due to duplicates, can someone help me ? WITH workspace AS ( SELECT account_id, workspace_id, workspace_name,...

Data Engineering

65 Views
2 replies
0 kudos

Thursday

View Replies

Latest Reply

nayan_wylde
Honored Contributor III

Thursday

0 kudos

It seems the duplicates are caused by the task_change_time from the job_tasks table. Even though the table definition shows task_change_time is the time last time the task was modifed.. But it is capturing different times and it is SCD type 2 table. ...

0 kudos

Thursday

1 More Replies

by thib • New Contributor III

06-14-2022 10:52:04 AM

8674 Views
5 replies
3 kudos

Can we use multiple git repos for a job running multiple tasks?

I have a job running multiple tasks :Task 1 runs a machine learning pipeline from git repo 1Task 2 runs an ETL pipeline from git repo 1Task 2 is actually a generic pipeline and should not be checked in repo 1, and will be made available in another re...

Data Engineering

8674 Views
5 replies
3 kudos

06-14-2022 10:52:04 AM

View Replies

Latest Reply

tors_r_us
New Contributor II

02-03-2025 10:27:07 PM

3 kudos

Had this same problem. Fix was to have two workflows with no triggers, each pointing to the respective git repo. Then setup a 3rd workflow with appropriate triggers/schedule which calls the first 2 workflows. A workflow can run other workflows.

3 kudos

02-03-2025 10:27:07 PM

4 More Replies

by shreya24 • New Contributor II

06-20-2025 8:24:16 AM

1823 Views
1 replies
2 kudos

Geometry Type not converted into proper binary format when reading through Federated Catalog

Hi,When reading a geometry column from a sql server into Databricks through foreign/federated catalog the tranformation of geometry type to binary type is not in proper format or I am not able to find a way I can decode that binary.for example, for p...

Data Engineering

1823 Views
1 replies
2 kudos

06-20-2025 8:24:16 AM

View Replies

Latest Reply

AbhaySingh
New Contributor

Thursday

2 kudos

Give this a shotCreate a view in SQL Server that converts geometry to Well-Known Text before federating:-- Create view in SQL ServerCREATE VIEW dbo.vw_spatial_converted ASSELECTid,location_name,location.STAsText() AS geom_wkt,location.STSrid() AS sri...

2 kudos

Thursday

by chanukya-pekala • Contributor III

Wednesday

181 Views
4 replies
4 kudos

Resolved! Lost access to Databricks account console on Free Edition

Hi everyone,I'm having trouble accessing the Databricks account console and need some guidance.Background:I successfully set up Databricks Free Edition with Terraform using my personal accountI was able to access accounts.cloud.databricks.com to obta...

Data Engineering

181 Views
4 replies
4 kudos

Wednesday

View Replies

Latest Reply

chanukya-pekala
Contributor III

Wednesday

4 kudos

I just double checked, I was able to manage my personal workspace through terraform without account console. Thanks again.

4 kudos

Wednesday

3 More Replies

Databricks Community

Forum Posts

Serverless - can't parallelize UDF in applyInPandas

Spark Streaming Loading 1kto 5k rows only delta table

Flattening the json in databricks

Reading snappy.parquet

Stop Cache in free edition

Databricks Job cluster for continuous run

Comparing Databricks Serverless Warehouse with Snowflake Virtual Warehouse for specific query

DDL script to upper environment

Resolved! Cannot sign in at databricks partner-academy portal

Serverless environment not respecting environment spec on run_job_task

Resolved! Python Wheel in Serverless Job in DAB

Query to calculate cost of task from each job by day

Can we use multiple git repos for a job running multiple tasks?

Geometry Type not converted into proper binary format when reading through Federated Catalog

Resolved! Lost access to Databricks account console on Free Edition

Join Us as a Local Community Builder!

DAB + DLT destroy fails due to ownership/permissio...

Can't enable "variantType-preview" using DLTs

Liquid Clustering With Merge

deadlock occurs with use statement

is there another way to authen to azure databricks...