Data Engineering

Forum Posts

Sorted by:

by lecarusin • New Contributor II

11-10-2025 11:25:51 AM

436 Views
4 replies
1 kudos

Help regarding a python notebook and s3 file structure

Hello all, I am new to this forum, so please forgive if I am posting in the wrong location (I'd appreciate if the post is moved by mods or am told where to post).I am looking for help with an optimization of a python code I have. This python notebook...

Data Engineering

436 Views
4 replies
1 kudos

11-10-2025 11:25:51 AM

View Replies

Latest Reply

arunpalanoor
New Contributor II

11-11-2025 2:33:20 AM

1 kudos

I am not sure if I fully understand how your data pipeline is setup, but have you considered incremental data loading say using something similar to "COPY INTO" method which would only read your incremental load, and then apply a 90 day filter on top...

1 kudos

11-11-2025 2:33:20 AM

3 More Replies

by vikram_p • Databricks Partner

11-11-2025 5:22:56 AM

1057 Views
1 replies
0 kudos

Resolved! Generate embeddings for 50 million rows in dataframe

Hello All,I have dataframe with 5 million rows and before we can setup vector search endpoint against index, we want to generate embeddings column for each of those rows. Please suggest whats an optimal way to do this?We are in development phase so w...

Data Engineering

1057 Views
1 replies
0 kudos

11-11-2025 5:22:56 AM

View Replies

Latest Reply

bianca_unifeye
Databricks MVP

11-11-2025 8:08:42 AM

0 kudos

The easiest and most reliable way to generate embeddings for millions of rows is to let Databricks Vector Search compute them automatically during synchronization from a Delta table.Vector Search can generate embeddings for you, keep them updated whe...

0 kudos

11-11-2025 8:08:42 AM

by Divya_Bhadauria • New Contributor III

11-11-2025 5:43:46 AM

256 Views
1 replies
0 kudos

Does Databricks Runtime 7.3+ include built-in Hadoop S3 connector configurations?

I came across the KB article S3 connection reset error, which mentions not using the following Spark settings for the Hadoop S3 connector for DBR 7.3 and above:spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl com.data...

Data Engineering

256 Views
1 replies
0 kudos

11-11-2025 5:43:46 AM

View Replies

Latest Reply

hasnat_unifeye
Databricks Partner

11-11-2025 7:44:00 AM

0 kudos

No, you don’t need to set those on DBR 7.3 and above.From 7.3+ Databricks already uses the newer Hadoop S3A connector by default, so those com.databricks.s3a.S3AFileSystem settings are not part of the default config and shouldn’t be added.If they are...

0 kudos

11-11-2025 7:44:00 AM

by bruce17 • Databricks Partner

11-10-2025 5:57:20 AM

1560 Views
2 replies
1 kudos

Output Not Displaying in Databricks Notebook on All-Purpose Compute Cluster

Hello All,I’m encountering an issue where output from standard Python commands such as print() or display(df) is not showing up correctly when running notebooks on an All-Purpose Compute cluster.Cluster DetailsCluster Type: All-Purpose ComputeRuntime...

Data Engineering

1560 Views
2 replies
1 kudos

11-10-2025 5:57:20 AM

View Replies

Latest Reply

Sahil_Kumar
Databricks Employee

11-10-2025 12:43:38 PM

1 kudos

Hi Surya, Do you face this issue only with DBR 17.3 all-purpose clusters? Did you try with lower DBRs? If not, please try and let me know. Also, from the Run menu, try “Clear state and outputs,” then re‑run the cell on the same cluster to rule out st...

1 kudos

11-10-2025 12:43:38 PM

1 More Replies

by spd_dat • New Contributor III

03-04-2025 4:29:59 AM

4477 Views
2 replies
0 kudos

Can you default to `execution-count: none` when stripping notebook outputs?

When committing to a git folder, IPYNB outputs are usually stripped, unless allowed by an admin setting and toggled by .databricks/commit_outputs. This sets the{"execution-count": 0, ... }within the IPYNB metadata. Is there a way to set it instead to...

Data Engineering

4477 Views
2 replies
0 kudos

03-04-2025 4:29:59 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-11-2025 2:38:06 AM

0 kudos

Databricks does not currently allow you to default to "execution_count": null (or "none") when stripping notebook outputs during a commit. The platform sets "execution_count": 0 as the default when outputs are stripped through their Git integration, ...

0 kudos

11-11-2025 2:38:06 AM

1 More Replies

by ayush667787878 • New Contributor

02-11-2025 5:56:05 AM

4018 Views
2 replies
1 kudos

not able to install library in normal site while in community version it working please help

I am not able to install library in normal version while in community editioin i am able to add libray using compute how to install in normal databricks same as community edition.

Data Engineering

4018 Views
2 replies
1 kudos

02-11-2025 5:56:05 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-11-2025 2:57:33 AM

1 kudos

To install libraries in the normal (paid) version of Databricks, use the cluster management interface to add libraries to your compute resources. The process is similar to the Community Edition, but workspace policies and cluster access mode may rest...

1 kudos

11-11-2025 2:57:33 AM

1 More Replies

by ask005 • New Contributor

07-13-2025 6:46:10 AM

2779 Views
1 replies
0 kudos

How to write ObjectId value using Spark connector 10.2.2

In pySpark mongo connector while updating records how to handle _id as objectId.spark 3.2.4scala2.13sparkMongoConnector 2.12-10.2.2

Data Engineering

2779 Views
1 replies
0 kudos

07-13-2025 6:46:10 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-11-2025 2:54:49 AM

0 kudos

To write an ObjectId value using Spark Mongo Connector 10.2.2 in PySpark while updating records, you must convert the ObjectId string into a special format. The Spark Mongo Connector does not automatically recognize a string as an ObjectId; it will o...

0 kudos

11-11-2025 2:54:49 AM

by dndeng • New Contributor II

10-16-2025 6:25:54 AM

805 Views
4 replies
0 kudos

Query to calculate cost of task from each job by day

I am trying to find the cost per Task in each Job every time it was executed (daily) but currently getting very huge numbers due to duplicates, can someone help me ? WITH workspace AS ( SELECT account_id, workspace_id, workspace_name,...

Data Engineering

805 Views
4 replies
0 kudos

10-16-2025 6:25:54 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-11-2025 2:53:39 AM

0 kudos

You are seeing inflated cost numbers because your query groups by many columns—especially run_id, task_key, usage_start_time, and usage_end_time—without addressing possible duplicate row entries that arise from your joins, especially with the system....

0 kudos

11-11-2025 2:53:39 AM

3 More Replies

by lmorrissey • New Contributor II

02-05-2025 4:48:21 PM

5329 Views
1 replies
0 kudos

GC Allocation Failure

There are a couple of related posts here and here.Seeing a similar issue with a long running job. Processes are in a "RUNNING" state, cluster is active, but stdout log shows the dreaded GC Allocation Failure. Env:I've set the following on the config:...

Data Engineering

5329 Views
1 replies
0 kudos

02-05-2025 4:48:21 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-11-2025 2:51:53 AM

0 kudos

A persistent "GC Allocation Failure" in Spark jobs, where processes are stuck in the RUNNING state even after attempts to clear cache and enforce GC, typically indicates ongoing memory pressure, possible data skew, or excessive memory use on the driv...

0 kudos

11-11-2025 2:51:53 AM

by itt • New Contributor II

02-09-2025 4:57:25 AM

5405 Views
3 replies
0 kudos

Graceful shutdown - stopping stream at the end of microbatch

Im trying to create a system where i let spark finish the current microbatch, and letting it know it should stop after it.The reason is that i don't want to re-calcualte a microbatch with "forcefully" stopping a stream.Is there a way spark/databricks...

Data Engineering

5405 Views
3 replies
0 kudos

02-09-2025 4:57:25 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-11-2025 2:50:13 AM

0 kudos

There is no built-in Spark or Databricks method to gracefully stop a Structured Streaming query specifically at the end of the current microbatch, but several community and expert discussions propose common strategies to achieve this: Official and Co...

0 kudos

11-11-2025 2:50:13 AM

2 More Replies

by Austin1 • New Contributor

02-05-2025 3:45:22 PM

4600 Views
1 replies
0 kudos

VSCode Integration for Data Science Analysts

Probably not posting this in the right forum, but can't find a good fit.This is a bit convuluted because we make things hard at work. I have access to a single LLM via VSCode (Amazon Q). Since I can't use that within Databricks but I want my team to...

Data Engineering

4600 Views
1 replies
0 kudos

02-05-2025 3:45:22 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-11-2025 2:48:55 AM

0 kudos

It’s a smart move to raise this question before investing lots of time—because with the Databricks VSCode extension, there are indeed specific limitations when it comes to accessing shared workspace folders that weren't originally created by the exte...

0 kudos

11-11-2025 2:48:55 AM

by thomas-totter • New Contributor III

08-29-2025 6:18:38 AM

2008 Views
5 replies
4 kudos

NativeADLGen2RequestComparisonHandler: Error in request comparison (when running DLT)

Since at least two weeks (but probably even longer) our DLT pipeline posts error messages to log4j (driver logs) like the one below. I tried with both channels (preview, current), switched between serverless and classic compute and started the pipeli...

Data Engineering

2008 Views
5 replies
4 kudos

08-29-2025 6:18:38 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-11-2025 2:47:32 AM

4 kudos

The error message you are observing in your DLT pipeline logs, specifically: text java.lang.NumberFormatException: For input string: "Fri, 29 Aug 2025 09:02:07 GMT" suggests that something in your pipeline (likely library or code respo...

4 kudos

11-11-2025 2:47:32 AM

4 More Replies

by chinmay0924 • New Contributor III

08-11-2025 10:53:37 PM

1715 Views
4 replies
2 kudos

mapInPandas not working in serverless compute

Using `mapInPandas` in serverless compute (Environment version 2) gives the following error,```Py4JError: An error occurred while calling o543.mapInPandas. Trace: py4j.Py4JException: Method mapInPandas([class org.apache.spark.sql.catalyst.expressions...

Data Engineering

1715 Views
4 replies
2 kudos

08-11-2025 10:53:37 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-11-2025 2:46:06 AM

2 kudos

The error you are seeing when using mapInPandas in serverless compute with Environment version 2 is due to an incompatibility in the environment's supported Spark features. Specifically, Environment version 2 on serverless compute does not support ma...

2 kudos

11-11-2025 2:46:06 AM

3 More Replies

by ChsAIkrishna • Contributor

01-21-2025 3:15:07 AM

5445 Views
2 replies
1 kudos

Vnet Gateway issues on Power bi Conn

Team,We are getting frequent vnet gateway failures on power bi Dataset using DAX(simple DAX not complex) and upon the rerun it is working, is any permanent fix for this ?Error :{"error":{"code":"DM_GWPipeline_Gateway_MashupDataAccessError","pbi.error...

Data Engineering

5445 Views
2 replies
1 kudos

01-21-2025 3:15:07 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-11-2025 2:44:57 AM

1 kudos

Frequent VNet gateway errors in Power BI related to “DM_GWPipeline_Gateway_MashupDataAccessError” and memory allocation issues often stem from resource limits, configuration problems, or inefficient modeling—even with simple DAX. No single “permanent...

1 kudos

11-11-2025 2:44:57 AM

1 More Replies

by swapnilmd • New Contributor II

02-20-2025 6:26:27 AM

5106 Views
2 replies
0 kudos

How to handle , Error parsing WKT: Invalid coordinate value '180' found at position

DBR Version- 16.2spark.databricks.geo.st.enabled trueSQL Query I am running: %sql WITH points ( SELECT st_astext(st_point(30D, 10D)) AS point_geom UNION SELECT st_astext(st_point(10D, 90D)) AS point_geom UNION SELECT st_astext(st_point(4...

Data Engineering

5106 Views
2 replies
0 kudos

02-20-2025 6:26:27 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

11-11-2025 2:43:39 AM

0 kudos

The error occurs because Databricks (based on GEOS/OGC standards) expects coordinates in Well-Known Text (WKT) that fall into valid ranges: Longitude (XX or first coordinate): −180≤X≤180−180≤X≤180 Latitude (YY or second coordinate): −90≤Y≤90−90≤Y≤9...

0 kudos

11-11-2025 2:43:39 AM

1 More Replies

Databricks Community

Forum Posts

Help regarding a python notebook and s3 file structure

Resolved! Generate embeddings for 50 million rows in dataframe

Does Databricks Runtime 7.3+ include built-in Hadoop S3 connector configurations?

Output Not Displaying in Databricks Notebook on All-Purpose Compute Cluster

Can you default to `execution-count: none` when stripping notebook outputs?

not able to install library in normal site while in community version it working please help

How to write ObjectId value using Spark connector 10.2.2

Query to calculate cost of task from each job by day

GC Allocation Failure

Graceful shutdown - stopping stream at the end of microbatch

VSCode Integration for Data Science Analysts

NativeADLGen2RequestComparisonHandler: Error in request comparison (when running DLT)

mapInPandas not working in serverless compute

Vnet Gateway issues on Power bi Conn

How to handle , Error parsing WKT: Invalid coordinate value '180' found at position

File Arrival Trigger - Multiple tables

Issue while handling Deletes and Inserts in Struct...

DLT with CDC and schema changes in streaming pipel...

how to update not tracked column only in new row v...

Databricks Cost Estimation Template