cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

dndeng
by New Contributor II
  • 267 Views
  • 4 replies
  • 0 kudos

Query to calculate cost of task from each job by day

I am trying to find the cost per Task in each Job every time it was executed (daily) but currently getting very huge numbers due to duplicates, can someone help me ?   WITH workspace AS ( SELECT account_id, workspace_id, workspace_name,...

  • 267 Views
  • 4 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

You are seeing inflated cost numbers because your query groups by many columns—especially run_id, task_key, usage_start_time, and usage_end_time—without addressing possible duplicate row entries that arise from your joins, especially with the system....

  • 0 kudos
3 More Replies
lmorrissey
by New Contributor II
  • 4007 Views
  • 1 replies
  • 0 kudos

GC Allocation Failure

There are a couple of related posts here and here.Seeing a similar issue with a long running job. Processes are in a "RUNNING" state, cluster is active, but stdout log shows the dreaded GC Allocation Failure. Env:I've set the following on the config:...

lmorrissey_2-1738802605421.png lmorrissey_0-1738801635404.png lmorrissey_1-1738801909227.png
  • 4007 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

A persistent "GC Allocation Failure" in Spark jobs, where processes are stuck in the RUNNING state even after attempts to clear cache and enforce GC, typically indicates ongoing memory pressure, possible data skew, or excessive memory use on the driv...

  • 0 kudos
itt
by New Contributor II
  • 4176 Views
  • 3 replies
  • 0 kudos

Graceful shutdown - stopping stream at the end of microbatch

Im trying to create a system where i let spark finish the current microbatch, and letting it know it should stop after it.The reason is that i don't want to re-calcualte a microbatch with "forcefully" stopping a stream.Is there a way spark/databricks...

  • 4176 Views
  • 3 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

There is no built-in Spark or Databricks method to gracefully stop a Structured Streaming query specifically at the end of the current microbatch, but several community and expert discussions propose common strategies to achieve this: Official and Co...

  • 0 kudos
2 More Replies
Austin1
by New Contributor
  • 3818 Views
  • 1 replies
  • 0 kudos

VSCode Integration for Data Science Analysts

Probably not posting this in the right forum, but can't find a good fit.This is a bit convuluted because we make things hard at work. I have access to a single LLM via VSCode (Amazon Q).  Since I can't use that within Databricks but I want my team to...

  • 3818 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

It’s a smart move to raise this question before investing lots of time—because with the Databricks VSCode extension, there are indeed specific limitations when it comes to accessing shared workspace folders that weren't originally created by the exte...

  • 0 kudos
thomas-totter
by New Contributor III
  • 1139 Views
  • 5 replies
  • 4 kudos

NativeADLGen2RequestComparisonHandler: Error in request comparison (when running DLT)

Since at least two weeks (but probably even longer) our DLT pipeline posts error messages to log4j (driver logs) like the one below. I tried with both channels (preview, current), switched between serverless and classic compute and started the pipeli...

  • 1139 Views
  • 5 replies
  • 4 kudos
Latest Reply
mark_ott
Databricks Employee
  • 4 kudos

The error message you are observing in your DLT pipeline logs, specifically:   text java.lang.NumberFormatException: For input string: "Fri, 29 Aug 2025 09:02:07 GMT" suggests that something in your pipeline (likely library or code respo...

  • 4 kudos
4 More Replies
chinmay0924
by New Contributor III
  • 856 Views
  • 4 replies
  • 1 kudos

mapInPandas not working in serverless compute

Using `mapInPandas` in serverless compute (Environment version 2) gives the following error,```Py4JError: An error occurred while calling o543.mapInPandas. Trace: py4j.Py4JException: Method mapInPandas([class org.apache.spark.sql.catalyst.expressions...

  • 856 Views
  • 4 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

The error you are seeing when using mapInPandas in serverless compute with Environment version 2 is due to an incompatibility in the environment's supported Spark features. Specifically, Environment version 2 on serverless compute does not support ma...

  • 1 kudos
3 More Replies
ChsAIkrishna
by Contributor
  • 4240 Views
  • 2 replies
  • 1 kudos

Vnet Gateway issues on Power bi Conn

Team,We are getting frequent vnet gateway failures on power bi Dataset using DAX(simple DAX not complex) and upon the rerun it is working, is any permanent fix for this ?Error :{"error":{"code":"DM_GWPipeline_Gateway_MashupDataAccessError","pbi.error...

  • 4240 Views
  • 2 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

Frequent VNet gateway errors in Power BI related to “DM_GWPipeline_Gateway_MashupDataAccessError” and memory allocation issues often stem from resource limits, configuration problems, or inefficient modeling—even with simple DAX. No single “permanent...

  • 1 kudos
1 More Replies
swapnilmd
by New Contributor II
  • 3841 Views
  • 2 replies
  • 0 kudos

How to handle , Error parsing WKT: Invalid coordinate value '180' found at position

DBR Version- 16.2spark.databricks.geo.st.enabled trueSQL Query I am running:  %sql WITH points ( SELECT st_astext(st_point(30D, 10D)) AS point_geom UNION SELECT st_astext(st_point(10D, 90D)) AS point_geom UNION SELECT st_astext(st_point(4...

  • 3841 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The error occurs because Databricks (based on GEOS/OGC standards) expects coordinates in Well-Known Text (WKT) that fall into valid ranges: Longitude (XX or first coordinate): −180≤X≤180−180≤X≤180 Latitude (YY or second coordinate): −90≤Y≤90−90≤Y≤9...

  • 0 kudos
1 More Replies
mkEngineer
by New Contributor III
  • 5105 Views
  • 3 replies
  • 0 kudos

How to Version & Deploy Databricks Workflows with Azure DevOps (CI/CD)?

Hi everyone,I’m trying to set up versioning and CI/CD for my Databricks workflows using Azure DevOps and Git. While I’ve successfully versioned notebooks in a Git repo, I’m struggling with handling workflows (which define orchestration, dependencies,...

  • 5105 Views
  • 3 replies
  • 0 kudos
Latest Reply
mkEngineer
New Contributor III
  • 0 kudos

As of now, my current approach is to manually copy/paste YAMLs across workspaces and version them using Git/Azure DevOps by saving them as DBFS files. The CD process is then handled using Databricks DBFS File Deployment by Data Thirst Ltd.While this ...

  • 0 kudos
2 More Replies
KurtWang
by New Contributor
  • 3668 Views
  • 1 replies
  • 0 kudos

UCX error with databricks labs ucx create-table-mapping

Hi,I am using UCX for unity catalog migration and up to the step of table migration. When I run the command databricks labs ucx create-table-mapping, it returns the error message 'ERROR [src/databricks/labs/ucx.create-table-mapping] ValueError: Pleas...

  • 3668 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

You are seeing two related errors during table migration with Databricks Labs UCX for Unity Catalog: For create-table-mapping:“ValueError: Please run as account-admin: databricks labs ucx sync-workspace-info” For sync-workspace-info:“Error: entrypo...

  • 0 kudos
sgaud
by New Contributor
  • 3649 Views
  • 1 replies
  • 0 kudos

(java.util.NoSuchElementException) key not found: date_of_birth#1554

I have a sql function which calls another function. For some reason when I run the notebook that calls that sql function stand alone, it works fine. When I run that notebook as part of a Workflow jobs it fails every time with that error. What is caus...

  • 3649 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Your error is caused by a missing field or column (date_of_birth#1554) in your Spark SQL logical plan, which is needed during query optimization. This issue happens only when running the notebook in a Workflow job because of subtle differences betwee...

  • 0 kudos
tp992
by New Contributor II
  • 3626 Views
  • 2 replies
  • 0 kudos

Using pyspark databricks UDFs with outside function imports

Problem with minimal exampleThe below minimal example does not run locally with databricks-connect==15.3 but does run within databricks workspace.main.pyfrom databricks.connect import DatabricksSession from module.udf import send_message, send_compl...

  • 3626 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The core issue is that PySpark UDFs require their entire closure—including any helper functions they call, such as _get_greeting—to be serializable and available on the worker nodes. In Databricks Workspace, the module distribution and packaging are ...

  • 0 kudos
1 More Replies
eyalholzmann
by New Contributor
  • 69 Views
  • 2 replies
  • 1 kudos

Does VACUUM on Delta Lake also clean Iceberg metadata when using Iceberg Uniform feature?

I'm working with Delta tables using the Iceberg Uniform feature to enable Iceberg-compatible reads. I’m trying to understand how metadata cleanup works in this setup.Specifically, does the VACUUM operation—which removes old Delta Lake metadata based ...

  • 69 Views
  • 2 replies
  • 1 kudos
Latest Reply
eyalholzmann
New Contributor
  • 1 kudos

Which actions should be used to clean up and maintain Iceberg metadata?expireSnapshots: Is it recommended to delete old snapshots using the same retention period as the Delta table?deleteOrphanFiles: This deletes unreferenced Iceberg metadata as well...

  • 1 kudos
1 More Replies
minhhung0507
by Valued Contributor
  • 1747 Views
  • 5 replies
  • 1 kudos

DeltaFileNotFoundException: [DELTA_TRUNCATED_TRANSACTION_LOG] Error in Streaming Table with Minimal

Dear Databricks Experts,I am encountering a recurring issue while working with Delta streaming tables in my system. The error message is as follows: com.databricks.sql.transaction.tahoe.DeltaFileNotFoundException: [DELTA_TRUNCATED_TRANSACTION_LOG] gs...

minhhung0507_0-1738728278906.png minhhung0507_1-1738728343460.png
  • 1747 Views
  • 5 replies
  • 1 kudos
Latest Reply
gbrueckl
Contributor II
  • 1 kudos

I would assume it is trying to read v899 because in you read up until v898 in the last [streaming]batch and stored the state in the streaming checkpoint. Now, if you run the code again and continue the stream, it tries to pick up from the first versi...

  • 1 kudos
4 More Replies
GANAPATI_HEGDE
by New Contributor III
  • 73 Views
  • 2 replies
  • 0 kudos

Unable to configure custom compute for DLT pipeline

I am trying to configure cluster for a pipeline like above, However dlt keeps using the small cluster as usual, how to resolve this? 

GANAPATI_HEGDE_0-1762754316899.png GANAPATI_HEGDE_1-1762754398253.png
  • 73 Views
  • 2 replies
  • 0 kudos
Latest Reply
GANAPATI_HEGDE
New Contributor III
  • 0 kudos

i updated my CLI and deployed the job, still i dont see the clusters updates in  pipeline

  • 0 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels