cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

lecarusin
by New Contributor II
  • 436 Views
  • 4 replies
  • 1 kudos

Help regarding a python notebook and s3 file structure

Hello all, I am new to this forum, so please forgive if I am posting in the wrong location (I'd appreciate if the post is moved by mods or am told where to post).I am looking for help with an optimization of a python code I have. This python notebook...

  • 436 Views
  • 4 replies
  • 1 kudos
Latest Reply
arunpalanoor
New Contributor II
  • 1 kudos

I am not sure if I fully understand how your data pipeline is setup, but have you considered incremental data loading say using something similar to "COPY INTO" method which would only read your incremental load, and then apply a 90 day filter on top...

  • 1 kudos
3 More Replies
vikram_p
by Databricks Partner
  • 1057 Views
  • 1 replies
  • 0 kudos

Resolved! Generate embeddings for 50 million rows in dataframe

Hello All,I have dataframe with 5 million rows and before we can setup vector search endpoint against index, we want to generate embeddings column for each of those rows. Please suggest whats an optimal way to do this?We are in development phase so w...

  • 1057 Views
  • 1 replies
  • 0 kudos
Latest Reply
bianca_unifeye
Databricks MVP
  • 0 kudos

The easiest and most reliable way to generate embeddings for millions of rows is to let Databricks Vector Search compute them automatically during synchronization from a Delta table.Vector Search can generate embeddings for you, keep them updated whe...

  • 0 kudos
Divya_Bhadauria
by New Contributor III
  • 256 Views
  • 1 replies
  • 0 kudos

Does Databricks Runtime 7.3+ include built-in Hadoop S3 connector configurations?

I came across the KB article S3 connection reset error, which mentions not using the following Spark settings for the Hadoop S3 connector for DBR 7.3 and above:spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl com.data...

  • 256 Views
  • 1 replies
  • 0 kudos
Latest Reply
hasnat_unifeye
Databricks Partner
  • 0 kudos

No, you don’t need to set those on DBR 7.3 and above.From 7.3+ Databricks already uses the newer Hadoop S3A connector by default, so those com.databricks.s3a.S3AFileSystem settings are not part of the default config and shouldn’t be added.If they are...

  • 0 kudos
bruce17
by Databricks Partner
  • 1560 Views
  • 2 replies
  • 1 kudos

Output Not Displaying in Databricks Notebook on All-Purpose Compute Cluster

Hello All,I’m encountering an issue where output from standard Python commands such as print() or display(df) is not showing up correctly when running notebooks on an All-Purpose Compute cluster.Cluster DetailsCluster Type: All-Purpose ComputeRuntime...

  • 1560 Views
  • 2 replies
  • 1 kudos
Latest Reply
Sahil_Kumar
Databricks Employee
  • 1 kudos

Hi Surya, Do you face this issue only with DBR 17.3 all-purpose clusters? Did you try with lower DBRs? If not, please try and let me know. Also, from the Run menu, try “Clear state and outputs,” then re‑run the cell on the same cluster to rule out st...

  • 1 kudos
1 More Replies
spd_dat
by New Contributor III
  • 4477 Views
  • 2 replies
  • 0 kudos

Can you default to `execution-count: none` when stripping notebook outputs?

When committing to a git folder, IPYNB outputs are usually stripped, unless allowed by an admin setting and toggled by .databricks/commit_outputs. This sets the{"execution-count": 0, ... }within the IPYNB metadata. Is there a way to set it instead to...

  • 4477 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Databricks does not currently allow you to default to "execution_count": null (or "none") when stripping notebook outputs during a commit. The platform sets "execution_count": 0 as the default when outputs are stripped through their Git integration, ...

  • 0 kudos
1 More Replies
ayush667787878
by New Contributor
  • 4018 Views
  • 2 replies
  • 1 kudos

not able to install library in normal site while in community version it working please help

I am not able to install library in normal version while in community editioin i am able to add libray using compute how to install in normal databricks same as community edition.   

SCR-20250211-qxiz.png ayush667787878_0-1739282136368.png
  • 4018 Views
  • 2 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

To install libraries in the normal (paid) version of Databricks, use the cluster management interface to add libraries to your compute resources. The process is similar to the Community Edition, but workspace policies and cluster access mode may rest...

  • 1 kudos
1 More Replies
ask005
by New Contributor
  • 2779 Views
  • 1 replies
  • 0 kudos

How to write ObjectId value using Spark connector 10.2.2

In pySpark mongo connector while updating records how to handle _id as objectId.spark 3.2.4scala2.13sparkMongoConnector 2.12-10.2.2

  • 2779 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

To write an ObjectId value using Spark Mongo Connector 10.2.2 in PySpark while updating records, you must convert the ObjectId string into a special format. The Spark Mongo Connector does not automatically recognize a string as an ObjectId; it will o...

  • 0 kudos
dndeng
by New Contributor II
  • 805 Views
  • 4 replies
  • 0 kudos

Query to calculate cost of task from each job by day

I am trying to find the cost per Task in each Job every time it was executed (daily) but currently getting very huge numbers due to duplicates, can someone help me ?   WITH workspace AS ( SELECT account_id, workspace_id, workspace_name,...

  • 805 Views
  • 4 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

You are seeing inflated cost numbers because your query groups by many columns—especially run_id, task_key, usage_start_time, and usage_end_time—without addressing possible duplicate row entries that arise from your joins, especially with the system....

  • 0 kudos
3 More Replies
lmorrissey
by New Contributor II
  • 5329 Views
  • 1 replies
  • 0 kudos

GC Allocation Failure

There are a couple of related posts here and here.Seeing a similar issue with a long running job. Processes are in a "RUNNING" state, cluster is active, but stdout log shows the dreaded GC Allocation Failure. Env:I've set the following on the config:...

lmorrissey_2-1738802605421.png lmorrissey_0-1738801635404.png lmorrissey_1-1738801909227.png
  • 5329 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

A persistent "GC Allocation Failure" in Spark jobs, where processes are stuck in the RUNNING state even after attempts to clear cache and enforce GC, typically indicates ongoing memory pressure, possible data skew, or excessive memory use on the driv...

  • 0 kudos
itt
by New Contributor II
  • 5405 Views
  • 3 replies
  • 0 kudos

Graceful shutdown - stopping stream at the end of microbatch

Im trying to create a system where i let spark finish the current microbatch, and letting it know it should stop after it.The reason is that i don't want to re-calcualte a microbatch with "forcefully" stopping a stream.Is there a way spark/databricks...

  • 5405 Views
  • 3 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

There is no built-in Spark or Databricks method to gracefully stop a Structured Streaming query specifically at the end of the current microbatch, but several community and expert discussions propose common strategies to achieve this: Official and Co...

  • 0 kudos
2 More Replies
Austin1
by New Contributor
  • 4600 Views
  • 1 replies
  • 0 kudos

VSCode Integration for Data Science Analysts

Probably not posting this in the right forum, but can't find a good fit.This is a bit convuluted because we make things hard at work. I have access to a single LLM via VSCode (Amazon Q).  Since I can't use that within Databricks but I want my team to...

  • 4600 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

It’s a smart move to raise this question before investing lots of time—because with the Databricks VSCode extension, there are indeed specific limitations when it comes to accessing shared workspace folders that weren't originally created by the exte...

  • 0 kudos
thomas-totter
by New Contributor III
  • 2008 Views
  • 5 replies
  • 4 kudos

NativeADLGen2RequestComparisonHandler: Error in request comparison (when running DLT)

Since at least two weeks (but probably even longer) our DLT pipeline posts error messages to log4j (driver logs) like the one below. I tried with both channels (preview, current), switched between serverless and classic compute and started the pipeli...

  • 2008 Views
  • 5 replies
  • 4 kudos
Latest Reply
mark_ott
Databricks Employee
  • 4 kudos

The error message you are observing in your DLT pipeline logs, specifically:   text java.lang.NumberFormatException: For input string: "Fri, 29 Aug 2025 09:02:07 GMT" suggests that something in your pipeline (likely library or code respo...

  • 4 kudos
4 More Replies
chinmay0924
by New Contributor III
  • 1715 Views
  • 4 replies
  • 2 kudos

mapInPandas not working in serverless compute

Using `mapInPandas` in serverless compute (Environment version 2) gives the following error,```Py4JError: An error occurred while calling o543.mapInPandas. Trace: py4j.Py4JException: Method mapInPandas([class org.apache.spark.sql.catalyst.expressions...

  • 1715 Views
  • 4 replies
  • 2 kudos
Latest Reply
mark_ott
Databricks Employee
  • 2 kudos

The error you are seeing when using mapInPandas in serverless compute with Environment version 2 is due to an incompatibility in the environment's supported Spark features. Specifically, Environment version 2 on serverless compute does not support ma...

  • 2 kudos
3 More Replies
ChsAIkrishna
by Contributor
  • 5445 Views
  • 2 replies
  • 1 kudos

Vnet Gateway issues on Power bi Conn

Team,We are getting frequent vnet gateway failures on power bi Dataset using DAX(simple DAX not complex) and upon the rerun it is working, is any permanent fix for this ?Error :{"error":{"code":"DM_GWPipeline_Gateway_MashupDataAccessError","pbi.error...

  • 5445 Views
  • 2 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

Frequent VNet gateway errors in Power BI related to “DM_GWPipeline_Gateway_MashupDataAccessError” and memory allocation issues often stem from resource limits, configuration problems, or inefficient modeling—even with simple DAX. No single “permanent...

  • 1 kudos
1 More Replies
swapnilmd
by New Contributor II
  • 5106 Views
  • 2 replies
  • 0 kudos

How to handle , Error parsing WKT: Invalid coordinate value '180' found at position

DBR Version- 16.2spark.databricks.geo.st.enabled trueSQL Query I am running:  %sql WITH points ( SELECT st_astext(st_point(30D, 10D)) AS point_geom UNION SELECT st_astext(st_point(10D, 90D)) AS point_geom UNION SELECT st_astext(st_point(4...

  • 5106 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The error occurs because Databricks (based on GEOS/OGC standards) expects coordinates in Well-Known Text (WKT) that fall into valid ranges: Longitude (XX or first coordinate): −180≤X≤180−180≤X≤180 Latitude (YY or second coordinate): −90≤Y≤90−90≤Y≤9...

  • 0 kudos
1 More Replies
Labels