cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

minhhung0507
by Valued Contributor
  • 170 Views
  • 1 replies
  • 1 kudos

Could not reach driver of cluster

I am running a pipeline job in Databricks and it failed with the following message:Run failed with error message Could not reach driver of cluster 5824-145411-p65jt7uo. This message is not very descriptive, and I am not able to identify the root ca...

minhhung0507_0-1756870994085.png
  • 170 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @minhhung0507 ,Typically this error could appear when there's a high load on the driver node. Another reason could be related to high garbage collection on driver node as well as high memory and cpu which leads to throttling, and prevents the driv...

  • 1 kudos
elgeo
by Valued Contributor II
  • 6204 Views
  • 7 replies
  • 8 kudos

Clean up _delta_log files

Hello experts. We are trying to clarify how to clean up the large amount of files that are being accumulated in the _delta_log folder (json, crc and checkpoint files). We went through the related posts in the forum and followed the below:SET spark.da...

  • 6204 Views
  • 7 replies
  • 8 kudos
Latest Reply
michaeljac1986
New Contributor II
  • 8 kudos

What you’re seeing is expected behavior — the _delta_log folder always keeps a history of JSON commit files, checkpoint files, and CRCs. Even if you lower delta.logRetentionDuration and run VACUUM, cleanup won’t happen immediately. A couple of points...

  • 8 kudos
6 More Replies
erigaud
by Honored Contributor
  • 9796 Views
  • 7 replies
  • 6 kudos

Resolved! SFTP Autoloader

Hello, Don't know if it is possible, but I am wondering if it is possible to ingest files from a SFTP server using autoloader ? Or do I have to first copy the files to my dbfs and then use autoloader on that location ? Thank you !

  • 9796 Views
  • 7 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Hi @erigaud  We haven't heard from you since the last response from​, @BriceBuso  and I was checking back to see if her suggestions helped you. Or else, If you have any solution, please share it with the community, as it can be helpful to others.  Al...

  • 6 kudos
6 More Replies
chiruinfo5262
by New Contributor II
  • 564 Views
  • 4 replies
  • 0 kudos

Trying to convert oracle sql to databricks sql but not getting the desired output

ORACLE SQL: COUNT( CASE WHEN TRUNC(WORKORDER.REPORTDATE) BETWEEN SELECTED_PERIOD_START_DATE AND SELECTED_PERIOD_END_DATE THEN 1 END ) SELECTED_PERIOD_BM,COUNT( CASE WHEN TRUNC(WORKORDER.REPORTDATE) BETWEEN COMPARISON_PERIOD_START_DATE AND COMPARISON_...

  • 564 Views
  • 4 replies
  • 0 kudos
Latest Reply
Granty
New Contributor II
  • 0 kudos

This is a helpful comparison! I've definitely run into similar date formatting issues when migrating queries. The Oracle TRUNC function and Databricks' DATE_FORMAT/CAST combo can be tricky to reconcile. Speaking of needing a break after debugging SQL...

  • 0 kudos
3 More Replies
james_
by New Contributor II
  • 293 Views
  • 5 replies
  • 0 kudos

Low worker utilisation in Spatial SQL

I am finding low worker node utilization when using Spatial SQL features. My cluster is DBR 17.1 with 2x workers and photon enabled.When I view the cluster metrics, they consistently show one worker around 30-50% utilized, the driver around 15-20%, a...

  • 293 Views
  • 5 replies
  • 0 kudos
Latest Reply
james_
New Contributor II
  • 0 kudos

Thank you again, @-werners- . I have a lot still to learn about partitioning and managing spatial data. Perhaps I mainly need more patience!

  • 0 kudos
4 More Replies
Yousry_Ibrahim
by New Contributor II
  • 763 Views
  • 8 replies
  • 4 kudos

Resolved! Directories added to the Python sys.path do not always work fine on executors for shared access mod

Let's assume we have a workspace folder containing two Python files.module1 with a simple addition function:def add_numbers(a, b): return a + bmodule2 with a dummy PySpark custom data source:from pyspark.sql.datasource import DataSource, DataSource...

Yousry_Ibrahim_3-1756774969049.png Yousry_Ibrahim_1-1756774189101.png Yousry_Ibrahim_2-1756774813091.png
  • 763 Views
  • 8 replies
  • 4 kudos
Latest Reply
Yousry_Ibrahim
New Contributor II
  • 4 kudos

Hi all,Thanks for the feedback and proposed ideas.@szymon_dybczak  Your idea of relative imports work when the module is hosted in a child directory to the current running notebook. It does not work if we need to go up one or two directories and navi...

  • 4 kudos
7 More Replies
ScottH
by New Contributor II
  • 526 Views
  • 3 replies
  • 3 kudos

Resolved! Installing Marketplace Listing via Python SDK...

I am trying to use the Databricks Python SDK to install a Databricks Marketplace listing to Unity Catalog. I am getting stuck on how to provide a valid consumer terms version when passing the "accepted_consumer_terms" parameter to the w.consumer_inst...

ScottH_0-1756486791644.png
  • 526 Views
  • 3 replies
  • 3 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 3 kudos

Hi @ScottH ,It took me about 2 hours to make it right, but here it is. You need to provide valid date. And you can ask, from where that date is coming from? It's coming from consumer listing: listings = w.consumer_listings.get(id= 'e913bea3-9a37-446c...

  • 3 kudos
2 More Replies
der
by Contributor
  • 380 Views
  • 2 replies
  • 2 kudos

DBR 17.1 Spatial SQL Functions and Apache Sedona

I noticed in the DBR 17.1 release notes that ST geospatial functions are now in public preview - great news for us since this means native support in Databricks.https://docs.databricks.com/aws/en/release-notes/runtime/17.1#expanded-spatial-sql-expres...

  • 380 Views
  • 2 replies
  • 2 kudos
Latest Reply
mjohns
Databricks Employee
  • 2 kudos

Here are a few answers, feel free to hit me up on LinkedIn (michaeljohns2) if you want to discuss more particulars wrt Databricks geospatial. Looks like Sedona 1.8.0 is the release to watch for with Spark 4.0 support, see https://github.com/apache/se...

  • 2 kudos
1 More Replies
mikvaar
by New Contributor III
  • 866 Views
  • 4 replies
  • 1 kudos

Resolved! DLT Pipelines with DABs - Support for tags field?

Hi all,I'm working with DABs and trying to define tags for DLT pipelines in the bundle YAML config. However, adding a `tags:` block under the pipeline results in the following warning: Warning: unknown field: tags This suggests that tags might not be...

  • 866 Views
  • 4 replies
  • 1 kudos
Latest Reply
nikhilj0421
Databricks Employee
  • 1 kudos

Hi @mikvaar, Yes, tags are not supported yet in DABs, but it is in the roadmap. The ETA for this is around first or second week of June. 

  • 1 kudos
3 More Replies
DRock
by New Contributor II
  • 3200 Views
  • 7 replies
  • 0 kudos

Resolved! ODBC data source to connect to a Databricks catalog.database via MS Access Not Working

When using an ODBC data source to connect to a Databricks catalog database via Microsoft Access, the tables are not listing/appearing in the MS Access database for selection.However, when using the same ODBC data source to connect to Microsoft Excel,...

  • 3200 Views
  • 7 replies
  • 0 kudos
Latest Reply
Senefelder
New Contributor II
  • 0 kudos

Why do «Databricks employee» keep answering with the same AI generated reply, when that obviously not is the solution? Has anyone been able to come up with a solution which actually works?

  • 0 kudos
6 More Replies
noorbasha534
by Valued Contributor II
  • 212 Views
  • 2 replies
  • 0 kudos

Databricks job calling DBT - persist job name

Hello all,Is it possible to persist Databricks job name into the Brooklyn audit tables data model when when a Databricks job calls DBT model?Currently, my colleagues persist audit information into fact & dimensional tables of the Brooklyn data model....

  • 212 Views
  • 2 replies
  • 0 kudos
Latest Reply
Yogesh_378691
Contributor
  • 0 kudos

Yes, it’s possible to include the Databricks job name in your Brooklyn audit tables, but it won’t happen automatically. Right now, only the job run ID is being logged, so you’d need to extend your audit logic a bit. One common approach is to pass the...

  • 0 kudos
1 More Replies
auso
by New Contributor
  • 2209 Views
  • 3 replies
  • 2 kudos

Asset Bundles: Shared libraries and notebooks in monorepo multi-bundle setup

I am part of a small team of Data Engineers which started using Databricks Asset Bundles one year ago. Our code base consists of typical ETL-workloads written primarily in Jupyter notebooks (.ipynb), and jobs (.yaml) with our codebase spanning across...

  • 2209 Views
  • 3 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

1. the easiest way to do this is to package your shared librabries into a wheel (suppose you use python).  Like that you do not have to mess with the pythonpath and you can install these libs automatically to any cluster (via policies or dabs or what...

  • 2 kudos
2 More Replies
yit
by Contributor
  • 244 Views
  • 3 replies
  • 2 kudos

Resolved! Autoloader: Trigger batch vs micro-batch (as in .forEachBatch)

Hey everyone,I’m trying to clarify a confusion in AutoLoader regarding trigger batches and micro-batches when using .forEachBatch.Here’s what I understand so far:Trigger batch – Controlled by cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTr...

Data Engineering
autoloader
batch
micro-batch
spark
  • 244 Views
  • 3 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @yit ,1. They are not quite the same. Trigger batch defines how many new files Auto Loader lists for ingestion per streaming trigger (this is controlled as you correctly pointed out by cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigge...

  • 2 kudos
2 More Replies
xavier_db
by New Contributor II
  • 133 Views
  • 1 replies
  • 1 kudos

Postgress Lakeflow connect

I want to get data from postgress using lakeflow connect for every 10 mins, how to set-up lakeflow connect, can you give step-by-step process, for creating lakeflow connect pipeline?

  • 133 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @xavier_db ,Postgres lakeflow connector is currently in private preview according to below thread:Solved: Lakeflow Connect - Postgres connector - Databricks Community - 127633But the thing is I cannot see it in Workspace Preview and Account Previe...

  • 1 kudos
ck7007
by New Contributor III
  • 275 Views
  • 3 replies
  • 3 kudos

Advanced Technique

Reduced Monthly Databricks Bill from $47K to $12.7KThe Problem: We were scanning 2.3TB for queries needing only 8GB of data.Three Quick Wins1. Multi-dimensional Partitioning (30% savings)# Beforedf.write.partitionBy("date").parquet(path)# After-parti...

  • 275 Views
  • 3 replies
  • 3 kudos
Latest Reply
BS_THE_ANALYST
Esteemed Contributor
  • 3 kudos

@ck7007 no worries. I asked a question on the other thread: https://community.databricks.com/t5/data-engineering/cost/td-p/130078 , I'm not sure if you're classing this thread as the duplicate or the other one so I'll repost.I didn't see you mention ...

  • 3 kudos
2 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels