cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Danish11052000
by New Contributor II
  • 27 Views
  • 2 replies
  • 0 kudos

Looking for Advice: Robust Backup Strategy for Databricks System Tables

HI,I’m planning to build a backup system for all Databricks system tables (audit, usage, price, history, etc.) to preserve data beyond retention limits. Currently, I’m using Spark Streaming with readStream + writeStream and checkpointing in LakeFlow ...

  • 27 Views
  • 2 replies
  • 0 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 0 kudos

Greetings @Danish11052000 , here’s a pragmatic way to choose, based on the nature of Databricks system tables and the guarantees you want.   Bottom line For ongoing replication to preserve data beyond free retention, a Lakeflow Declarative Pipeline w...

  • 0 kudos
1 More Replies
pabloratache
by New Contributor
  • 49 Views
  • 4 replies
  • 2 kudos

Resolved! [FREE TRIAL] Missing All-Purpose Clusters Access - New Account

Issue Description: I created a new Databricks Free Trial account ("For Work" plan with $400 credits) but I don't have access to All-Purpose Clusters or PySpark compute. My workspace only shows SQL-only features.Current Setup:- Account Email: ronel.ra...

  • 49 Views
  • 4 replies
  • 2 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 2 kudos

Ah, got it @pabloratache , I did some digging and here is what I found (learned a few things myself). Thanks for the detailed context — this behavior is expected for the current Databricks 14‑day Free Trial (“For Work” plan).   What’s happening with ...

  • 2 kudos
3 More Replies
FarhanM
by Visitor
  • 28 Views
  • 1 replies
  • 0 kudos

Resolved! Databricks Streaming: Recommended Cluster Types and Best Practices

Hi Community, I recently built some streaming pipelines (Autoloader-based) that extract JSON data from the Data Lake and, after parsing and logging, dump it into the Delta Lake bronze layer. Since these are streaming pipelines, they are supposed to r...

  • 28 Views
  • 1 replies
  • 0 kudos
Latest Reply
biancadoesdata1
New Contributor II
  • 0 kudos

When running streaming pipelines, the key is to design for stability and isolation, not to rely on restart jobs.The first thing to do is run your streams on Jobs Compute, not All-Purpose clusters. If available, use Serverless Jobs. Each pipeline shou...

  • 0 kudos
fundat
by New Contributor II
  • 148 Views
  • 2 replies
  • 2 kudos

Resolved! Course - Introduction to Apache Spark

Hi,In the course Introduction to Apache Spark; according to Apache Spark Runtime Architecture; Page 6 of 15. It says that :The cluster manager allocates resources and assigns tasks......Workers perform tasks assigned by the driverCan you help me plea...

fundat_3-1761596488970.png
  • 148 Views
  • 2 replies
  • 2 kudos
Latest Reply
BS_THE_ANALYST
Esteemed Contributor III
  • 2 kudos

Hi @fundat Perhaps the picture is useful here:Give this blog a read, I think this will answer some of your questions: https://medium.com/@knoldus/understanding-the-working-of-spark-driver-and-executor-4fec0e669399 .All the best,BS

  • 2 kudos
1 More Replies
mkwparth
by New Contributor III
  • 70 Views
  • 2 replies
  • 1 kudos

Resolved! DLT | Communication lost with driver | Cluster was not reachable for 120 seconds

Hey Community, I'm facing this error, It says that "com.databricks.pipelines.common.errors.deployment.DeploymentException: Communication lost with driver. Cluster 1030-205818-yu28ft9s was not reachable for 120 seconds" This issue occurred in producti...

mkwparth_0-1761892686441.png
  • 70 Views
  • 2 replies
  • 1 kudos
Latest Reply
nayan_wylde
Esteemed Contributor
  • 1 kudos

This is actually a known intermittent issue in Databricks, particularly with streaming or Delta Live Tables (DLT) pipelines.This isn’t a logical failure in your code — it’s an infrastructure-level timeout between the Databricks control plane and the ...

  • 1 kudos
1 More Replies
CaptainJack
by New Contributor III
  • 64 Views
  • 1 replies
  • 0 kudos

Pull workspace url and workspace name using databricks-sdk / programaticaly in notebook

1. How could I pull workspace url (https://adb-XXXXX.XX.....net) 2. How could I get workspace name visible in top right corner.I know that easies solution is dbutils.notebook.entry_point.... browserHostName but unfortunetly it is not working in job c...

  • 64 Views
  • 1 replies
  • 0 kudos
Latest Reply
AbhaySingh
Databricks Employee
  • 0 kudos

Can you give this a shot? Not sure if you've a hard requirement of using SDK.  workspace_url = spark.conf.get('spark.databricks.workspaceUrl') Getting name is more tricky. You could potentially get it from tags if there is a tagging strategy in place...

  • 0 kudos
bidek56
by Contributor
  • 45 Views
  • 1 replies
  • 1 kudos

Resolved! Stack traces as standard error in job logs

When using DBR 16.4, I am seeing a lot of Stack traces as standard error in jobs, any idea why they are showing up and how to turn then off? Thx"FlagSettingCacheMetricsTimer" id=18 state=WAITING- waiting on <0x2d1573c6> (a java.util.TaskQueue)- locke...

  • 45 Views
  • 1 replies
  • 1 kudos
Latest Reply
bidek56
Contributor
  • 1 kudos

spark.databricks.driver.disableJvmThreadDump=trueThis setting will remove the ST. 

  • 1 kudos
bidek56
by Contributor
  • 73 Views
  • 0 replies
  • 0 kudos

Location of spark.scheduler.allocation.file

In DBR 164.LTS, I am trying to add the following Spark config: spark.scheduler.allocation.file: file:/Workspace/init/fairscheduler.xmlBut the all purpose cluster is throwing this error Spark error: Driver down cause: com.databricks.backend.daemon.dri...

  • 73 Views
  • 0 replies
  • 0 kudos
GiriSreerangam
by New Contributor III
  • 124 Views
  • 2 replies
  • 1 kudos

Resolved! org.apache.spark.SparkRuntimeException: [UDF_USER_CODE_ERROR.GENERIC]

Hi EveryoneI am writing a small function, with spark read from a csv and spark write into a table. I could execute this function within the notebook. But, when I register the same function as a unity catalog function and calling it from Playground, i...

GiriSreerangam_0-1761761391719.png
  • 124 Views
  • 2 replies
  • 1 kudos
Latest Reply
KaushalVachhani
Databricks Employee
  • 1 kudos

Hi @GiriSreerangam, You cannot use a Unity Catalog user-defined function (UDF) in Databricks to perform Spark read from a CSV and write to a table. Unity Catalog Python UDFs execute in a secure, isolated environment without access to the file system ...

  • 1 kudos
1 More Replies
JuliandaCruz
by New Contributor II
  • 365 Views
  • 4 replies
  • 0 kudos

Access to Databricks Volumes via Databricks Connect not working anymore

Hi all, I use the extension to debug my python code regularly and since yesterday accessing files in the Databricks Volume isn't working anymore. The situation in the UI of Databricks is as follows:When I execute a glob statement to list all zip-file...

  • 365 Views
  • 4 replies
  • 0 kudos
Latest Reply
mmayorga
Databricks Employee
  • 0 kudos

  hi @JuliandaCruz  Thank you for reaching out! I was able to reproduce your case while using Databricks Connect. The "Upload and Run file" option worked fine and returned results, which is essentially the same as running from the Databricks UI. Thou...

  • 0 kudos
3 More Replies
smoortema
by Contributor
  • 217 Views
  • 5 replies
  • 4 kudos

Resolved! when automatic liquid clustering is enabled, how to know which columns are used for clustering?

Let's say a table is configured to have automatic liquid clustering:ALTER TABLE table1 CLUSTER BY AUTO; How to know which columns were chosen by Databricks?

  • 217 Views
  • 5 replies
  • 4 kudos
Latest Reply
smoortema
Contributor
  • 4 kudos

From the documentation, it seems that in Python, there is such an option, only when creating or replacing a table.# To set clustering columns and auto, which serves as a way to give a hint # for the initial selection. df.writeTo(...).using("delta") ...

  • 4 kudos
4 More Replies
Jonathan_
by New Contributor III
  • 637 Views
  • 7 replies
  • 7 kudos

Slow PySpark operations after long DAG that contains many joins and transformations

We are using PySpark and notice that when we are doing many transformations/aggregations/joins of the data then at some point the execution time of simple task (count, display, union of 2 tables, ...) become very slow even if we have a small data (ex...

  • 637 Views
  • 7 replies
  • 7 kudos
Latest Reply
Jonathan_
New Contributor III
  • 7 kudos

It's a cluster with 128 GO of memory, when looking in Spark UI there is 54 GO for storage memory. Honestly I don't think it's memory issue like I said it's a small data and if we do checkpoint at same point then continu we don't have the problem afte...

  • 7 kudos
6 More Replies
Saf4Databricks
by New Contributor III
  • 420 Views
  • 3 replies
  • 2 kudos

Resolved! Cannot import pyspark.pipelines module

Question: What could be a cause of the following error of my code in a Databricks notebook, and how can we fix the error? I'm using latest Free Edition of Databricks that has runtime version 17.2 and PySpark version 4.0.0.Error:ImportError: cannot im...

  • 420 Views
  • 3 replies
  • 2 kudos
Latest Reply
dkushari
Databricks Employee
  • 2 kudos

Hi @Saf4Databricks - Are you trying to use it from a standalone Databricks notebook? You should only use it from with Lakeflow Declarative Pipeline (LDP). The link you shared is about LDP. Here is an example where I used it.    

  • 2 kudos
2 More Replies
Mous92i
by New Contributor II
  • 327 Views
  • 3 replies
  • 2 kudos

Resolved! Liquid Clustering With Merge

Hello I’m facing severe performance issues with a  merge into databricksmerge_condition = """ source.data_hierarchy = target.data_hierarchy AND source.sensor_id = target.sensor_id AND source.timestamp = target.timestamp """The target Delt...

  • 327 Views
  • 3 replies
  • 2 kudos
Latest Reply
Mous92i
New Contributor II
  • 2 kudos

Thanks for your response

  • 2 kudos
2 More Replies
Bhavana_Y
by New Contributor
  • 231 Views
  • 1 replies
  • 1 kudos

Resolved! Learning Path for Spark Developer Associate

Hello Everyone,Happy for being a part of Virtual Journey !!Enrolled in Associate Spark Developer and completed learning path in Databricks Academy. Can anyone please confirm is completing learning path enough for obtaining 50% off voucher for certifi...

Screenshot (15).png
  • 231 Views
  • 1 replies
  • 1 kudos
Latest Reply
Advika
Databricks Employee
  • 1 kudos

Hello @Bhavana_Y! To be eligible for the incentives, you’ll need to complete one of the pathways mentioned in the Learning Festival post. Based on your screenshot, it looks like you’ve completed all four modules of LEARNING PATHWAY 7: APACHE SPARK DE...

  • 1 kudos
Labels