Data Engineering

Forum Posts

Sorted by:

by protmaksim • New Contributor II

22m ago

4 Views
0 replies
0 kudos

How Upgrading to Databricks Runtime 16.4 sped up our Python script by 10x

Wanted to share something that might save others time and money. We had a complex Databricks script that ran over 1.5 hours, when the target was under 20 minutes. Initially tried scaling up the cluster, but real progress came from simply upgrading th...

Data Engineering

4 Views
0 replies
0 kudos

22m ago

by lecarusin • Visitor

yesterday

47 Views
4 replies
1 kudos

Help regarding a python notebook and s3 file structure

Hello all, I am new to this forum, so please forgive if I am posting in the wrong location (I'd appreciate if the post is moved by mods or am told where to post).I am looking for help with an optimization of a python code I have. This python notebook...

Data Engineering

47 Views
4 replies
1 kudos

yesterday

View Replies

Latest Reply

arunpalanoor
New Contributor II

9 hours ago

1 kudos

I am not sure if I fully understand how your data pipeline is setup, but have you considered incremental data loading say using something similar to "COPY INTO" method which would only read your incremental load, and then apply a 90 day filter on top...

1 kudos

9 hours ago

3 More Replies

by shubham007 • New Contributor III

Sunday

35 Views
1 replies
0 kudos

Urgency: How to do Data Migration task using Databricks Lakebridge tool ?

Dear community expert,I have completed two phases Analyzer & Converter of Databricks Lakebridge but stuck at migrating data from source to target using lakebridge. I have watched BrickBites Series on Lakebridge but did not find on how to migrate data...

Data Engineering

35 Views
1 replies
0 kudos

Sunday

View Replies

Latest Reply

bianca_unifeye
New Contributor II

2 hours ago

0 kudos

Lakebridge doesn’t copy data. It covers Assessment → Conversion (Analyzer/Converter) → Reconciliation.The fastest way is to use Lakehouse Federation. Create a Snowflake connection in Unity Catalog and run federated queries from Databricks. For perman...

0 kudos

2 hours ago

by DatabricksEngi1 • New Contributor III

9 hours ago

25 Views
1 replies
0 kudos

MERGE operation not performing data skipping with liquid clustering on key columns

Hi, I need some help understanding a performance issue.I have a table that reads approximately 800K records every 30 minutes in an incremental manner.Let’s say its primary key is:timestamp, x, y This table is overwritten every 30 minutes and serves ...

Data Engineering

25 Views
1 replies
0 kudos

9 hours ago

View Replies

Latest Reply

bianca_unifeye
New Contributor II

2 hours ago

0 kudos

MERGE is not a pure read plus filter operationEven though Liquid Clustering organizes your data by key ranges and writes min/max stats, the MERGE engine has to identify both matches and non-matches.That means the query planner must:Scan all candidate...

0 kudos

2 hours ago

by Akshay_Petkar • Valued Contributor

yesterday

43 Views
1 replies
0 kudos

Advanced Data Engineering Event and Free Certification Voucher

Hi everyone,In the past couple of years, Databricks has organized an Advanced Data Engineering event where attendees received a 100% free certification voucher under their organization account after attending the session.I wanted to check if this eve...

Data Engineering

43 Views
1 replies
0 kudos

yesterday

View Replies

Latest Reply

bianca_unifeye
New Contributor II

2 hours ago

0 kudos

I’m only aware of the Databricks Learning Festival, which typically offers a 50% discount voucher for certification, rather than a full-voucher.I couldn’t find any official confirmation of a 100% free voucher for an “Advanced Data Engineering” event ...

0 kudos

2 hours ago

by cdn_yyz_yul • New Contributor II

yesterday

17 Views
1 replies
0 kudos

delta as streaming source, can the reader reads only newly appended rows?

Hello everyone,In our implementation of Medallion Architecture, we want to stream changes with spark structured streaming. I would like some advice on how to use delta table as source correctly, and if there is performance (memory usage) concern in t...

Data Engineering

17 Views
1 replies
0 kudos

yesterday

View Replies

Latest Reply

bianca_unifeye
New Contributor II

2 hours ago

0 kudos

First of all, you are using append-only reads, which means that every time your stream triggers, Spark will process the entire Delta snapshot rather than just the changes.That’s why you’re observing the memory usage increase after each run, it’s not ...

0 kudos

2 hours ago

by vikram_p • Visitor

6 hours ago

16 Views
1 replies
0 kudos

Generate embeddings for 50 million rows in dataframe

Hello All,I have dataframe with 5 million rows and before we can setup vector search endpoint against index, we want to generate embeddings column for each of those rows. Please suggest whats an optimal way to do this?We are in development phase so w...

Data Engineering

16 Views
1 replies
0 kudos

6 hours ago

View Replies

Latest Reply

bianca_unifeye
New Contributor II

3 hours ago

0 kudos

The easiest and most reliable way to generate embeddings for millions of rows is to let Databricks Vector Search compute them automatically during synchronization from a Delta table.Vector Search can generate embeddings for you, keep them updated whe...

0 kudos

3 hours ago

by Divya_Bhadauria • New Contributor III

5 hours ago

22 Views
1 replies
0 kudos

Does Databricks Runtime 7.3+ include built-in Hadoop S3 connector configurations?

I came across the KB article S3 connection reset error, which mentions not using the following Spark settings for the Hadoop S3 connector for DBR 7.3 and above:spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl com.data...

Data Engineering

22 Views
1 replies
0 kudos

5 hours ago

View Replies

Latest Reply

hasnat_unifeye
Visitor

3 hours ago

0 kudos

No, you don’t need to set those on DBR 7.3 and above.From 7.3+ Databricks already uses the newer Hadoop S3A connector by default, so those com.databricks.s3a.S3AFileSystem settings are not part of the default config and shouldn’t be added.If they are...

0 kudos

3 hours ago

by ShivangiB1 • New Contributor III

yesterday

26 Views
1 replies
0 kudos

DATABRICKS LAKEFLOW SQL SERVER INGESTION PIPELINE ERROR

Hey Team,I am getting below error while creating pipeline :com.databricks.pipelines.execution.extensions.managedingestion.errors.ManagedIngestionNonRetryableException: [INGESTION_GATEWAY_DDL_OBJECTS_MISSING] DDL objects missing on table 'coedb.dbo.so...

Data Engineering

26 Views
1 replies
0 kudos

yesterday

View Replies

Latest Reply

ShivangiB1
New Contributor III

6 hours ago

0 kudos

Hey Team, can anyone help on this

0 kudos

6 hours ago

by Surya-Prathap • New Contributor

yesterday

80 Views
2 replies
1 kudos

Output Not Displaying in Databricks Notebook on All-Purpose Compute Cluster

Hello All,I’m encountering an issue where output from standard Python commands such as print() or display(df) is not showing up correctly when running notebooks on an All-Purpose Compute cluster.Cluster DetailsCluster Type: All-Purpose ComputeRuntime...

Data Engineering

80 Views
2 replies
1 kudos

yesterday

View Replies

Latest Reply

Sahil_Kumar
Databricks Employee

yesterday

1 kudos

Hi Surya, Do you face this issue only with DBR 17.3 all-purpose clusters? Did you try with lower DBRs? If not, please try and let me know. Also, from the Run menu, try “Clear state and outputs,” then re‑run the cell on the same cluster to rule out st...

1 kudos

yesterday

1 More Replies

by spd_dat • New Contributor III

03-04-2025 4:29:59 AM

3543 Views
2 replies
0 kudos

Can you default to `execution-count: none` when stripping notebook outputs?

When committing to a git folder, IPYNB outputs are usually stripped, unless allowed by an admin setting and toggled by .databricks/commit_outputs. This sets the{"execution-count": 0, ... }within the IPYNB metadata. Is there a way to set it instead to...

Data Engineering

3543 Views
2 replies
0 kudos

03-04-2025 4:29:59 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

8 hours ago

0 kudos

Databricks does not currently allow you to default to "execution_count": null (or "none") when stripping notebook outputs during a commit. The platform sets "execution_count": 0 as the default when outputs are stripped through their Git integration, ...

0 kudos

8 hours ago

1 More Replies

by pooja_bhumandla • New Contributor III

2 weeks ago

229 Views
2 replies
1 kudos

When to Use and when Not to Use Liquid Clustering?

Hi everyone,I’m looking for some practical guidance and experiences around when to choose Liquid Clustering versus sticking with traditional partitioning + Z-ordering.From what I’ve gathered so far:For small tables (<10TB), Liquid Clustering gives s...

Data Engineering

229 Views
2 replies
1 kudos

2 weeks ago

View Replies

Latest Reply

mark_ott
Databricks Employee

8 hours ago

1 kudos

Deciding between Liquid Clustering and traditional partitioning with Z-ordering depends on table size, query patterns, number of clustering columns, and file optimization needs. For tables under 10TB with queries consistently filtered on 1–2 columns,...

1 kudos

8 hours ago

1 More Replies

by ayush667787878 • New Contributor

02-11-2025 5:56:05 AM

3301 Views
2 replies
1 kudos

not able to install library in normal site while in community version it working please help

I am not able to install library in normal version while in community editioin i am able to add libray using compute how to install in normal databricks same as community edition.

Data Engineering

3301 Views
2 replies
1 kudos

02-11-2025 5:56:05 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

8 hours ago

1 kudos

To install libraries in the normal (paid) version of Databricks, use the cluster management interface to add libraries to your compute resources. The process is similar to the Community Edition, but workspace policies and cluster access mode may rest...

1 kudos

8 hours ago

1 More Replies

by ask005 • New Contributor

07-13-2025 6:46:10 AM

2050 Views
1 replies
0 kudos

How to write ObjectId value using Spark connector 10.2.2

In pySpark mongo connector while updating records how to handle _id as objectId.spark 3.2.4scala2.13sparkMongoConnector 2.12-10.2.2

Data Engineering

2050 Views
1 replies
0 kudos

07-13-2025 6:46:10 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

8 hours ago

0 kudos

To write an ObjectId value using Spark Mongo Connector 10.2.2 in PySpark while updating records, you must convert the ObjectId string into a special format. The Spark Mongo Connector does not automatically recognize a string as an ObjectId; it will o...

0 kudos

8 hours ago

by dndeng • New Contributor II

4 weeks ago

254 Views
4 replies
0 kudos

Query to calculate cost of task from each job by day

I am trying to find the cost per Task in each Job every time it was executed (daily) but currently getting very huge numbers due to duplicates, can someone help me ? WITH workspace AS ( SELECT account_id, workspace_id, workspace_name,...

Data Engineering

254 Views
4 replies
0 kudos

4 weeks ago

View Replies

Latest Reply

mark_ott
Databricks Employee

8 hours ago

0 kudos

You are seeing inflated cost numbers because your query groups by many columns—especially run_id, task_key, usage_start_time, and usage_end_time—without addressing possible duplicate row entries that arise from your joins, especially with the system....

0 kudos

8 hours ago

3 More Replies

Databricks Community

Forum Posts

How Upgrading to Databricks Runtime 16.4 sped up our Python script by 10x

Help regarding a python notebook and s3 file structure

Urgency: How to do Data Migration task using Databricks Lakebridge tool ?

MERGE operation not performing data skipping with liquid clustering on key columns

Advanced Data Engineering Event and Free Certification Voucher

delta as streaming source, can the reader reads only newly appended rows?

Generate embeddings for 50 million rows in dataframe

Does Databricks Runtime 7.3+ include built-in Hadoop S3 connector configurations?

DATABRICKS LAKEFLOW SQL SERVER INGESTION PIPELINE ERROR

Output Not Displaying in Databricks Notebook on All-Purpose Compute Cluster

Can you default to `execution-count: none` when stripping notebook outputs?

When to Use and when Not to Use Liquid Clustering?

not able to install library in normal site while in community version it working please help

How to write ObjectId value using Spark connector 10.2.2

Query to calculate cost of task from each job by day

Join Us as a Local Community Builder!

Could not connect Self Hosted MySQL Database in Az...

how to use azure one lake in aws databricks unity ...

GitLab Integration

Resource Throttling; Large Merge Operation - Recen...

Databricks Asset Bundles - High Level Diagrams Flo...