cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

bhargavabasava
by New Contributor III
  • 3877 Views
  • 3 replies
  • 0 kudos

Resolved! Job compute is taking longer even after using pool

Hi team,We created a workflow and attached it to a job cluster (which is configured to use compute pool). When we run the pipeline, it takes up to 5 minutes to go into clusterReady state and this is adding latency to our use case. Even with subsequen...

  • 3877 Views
  • 3 replies
  • 0 kudos
Latest Reply
Isi
Honored Contributor III
  • 0 kudos

Hey @bhargavabasava ,Job Cluster + Compute Pools: Long Startup TimesIf you’re using Job Clusters backed by compute pools, the initial delay (~5 minutes) is usually due to cluster provisioning. While compute pools are designed to reduce cold start tim...

  • 0 kudos
2 More Replies
dbernabeuplx
by New Contributor II
  • 2646 Views
  • 5 replies
  • 0 kudos

Resolved! How to delete/empty notebook output

I need to clear cell output in Databricks notebooks using dbutils or the API. As for my requirements, I need to clear it for data security reasons. That is, given a notebook's PATH, I would like to be able to clear all its outputs, as is done through...

Data Engineering
API
Data
issue
Notebooks
  • 2646 Views
  • 5 replies
  • 0 kudos
Latest Reply
srinum89
New Contributor III
  • 0 kudos

For Programmatic approach, you can also clear the each cell output individually using IPython package. Unfortunately, you need to do this in each and every cell. from IPython.display import clear_output clear_output(wait=True) 

  • 0 kudos
4 More Replies
amitkamthane
by New Contributor II
  • 1848 Views
  • 3 replies
  • 0 kudos

Resolved! Delete files from databricks Volumes based on trigger

Hi,I noticed there's a file arrival trigger option in the workflow but cant see delete trigger option. However, let's say I want to delete files from the Databricks volume based on this trigger, and also remove the corresponding records from the bron...

  • 1848 Views
  • 3 replies
  • 0 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 0 kudos

Currently, Databricks doesn’t offer a built-in file deletion trigger mechanism similar to the file arrival trigger. The file arrival trigger only monitors for new files being added to a location, not for files being deleted.

  • 0 kudos
2 More Replies
Ambesh
by New Contributor III
  • 21322 Views
  • 8 replies
  • 1 kudos

Reading external Iceberg table

Hi all, I am trying to Read an external Iceberg table.  A separate spark sql script creates my Iceberg table and now i need to read the Iceberg tables(created outside of databricks) from my Databricks notebook. Could someone tell me the approach for ...

  • 21322 Views
  • 8 replies
  • 1 kudos
Latest Reply
Sash
New Contributor II
  • 1 kudos

Hi, I'm facing the same problem.However, when set the access mode to "No isolation shared" I loose access to the external location where the Iceberg table resides. Is there a way to force Spark to NOT use catalog even when in the "Standard (formerly ...

  • 1 kudos
7 More Replies
nielsehlers
by New Contributor
  • 1380 Views
  • 1 replies
  • 1 kudos

from_utc_time gives strange results

I don't understand why from_utc_time(col("original_time"), "Europe/Berlin") changes the timestamp instead of just setting the timezone. That's a non-intuitive behaviour.   spark.conf.set("spark.sql.session.timeZone", "UTC")from pyspark.sql import Row...

  • 1380 Views
  • 1 replies
  • 1 kudos
Latest Reply
Advika
Community Manager
  • 1 kudos

Hello @nielsehlers! Just to clarify, PySpark's from_utc_timestamp converts a UTC timestamp to the specified timezone (in this case it's Europe/Berlin), adjusting the actual timestamp value rather than just setting timezone metadata. This happens beca...

  • 1 kudos
al_rammos
by New Contributor II
  • 2314 Views
  • 2 replies
  • 0 kudos

DROP VIEW IF EXISTS Failing on Dynamically Generated Temporary View in Databricks 15.4 LTS

Hello everyone,I'm experiencing a very strange issue with temporary views in Databricks 15.4 LTS that did not occur in 13.3. I have a workflow where I create a temporary view, run a query against it, and then drop it using a DROP VIEW IF EXISTS comma...

  • 2314 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @al_rammos, Thanks for your detail comments and replication of the issue. There have been known issues in recent DBR versions where dynamically created temporary views are not being properly resolved during certain operations due to incorrect sess...

  • 0 kudos
1 More Replies
Volker
by Contributor
  • 2510 Views
  • 4 replies
  • 1 kudos

Retention Period for Parquet Data in e.g. S3 After Dropping a Managed Delta Table

Hey community,I have a question regarding the data retention policy for managed Delta tables stored e.g. in Amazon S3. Specifically:​When a managed Delta table is dropped, what is the retention period for the underlying Parquet data files in S3 befor...

  • 2510 Views
  • 4 replies
  • 1 kudos
Latest Reply
Volker
Contributor
  • 1 kudos

Thanks for the resources!So, to adjust how long Parquet files are stored in the S3 bucket after I drop a table, I would need to adjust the delta.logRetentionDuration, right?And since dropping a Delta table marks the files for deletion after 7 days, I...

  • 1 kudos
3 More Replies
sandy311
by New Contributor III
  • 6872 Views
  • 5 replies
  • 5 kudos

if else conditions in databricks asset bundles

Can I use if-else conditions in databricks.yml and parameterize my asset bundles similarly to Azure Pipelines YAML?

  • 6872 Views
  • 5 replies
  • 5 kudos
Latest Reply
davidcardoner
New Contributor II
  • 5 kudos

Can we define a task based on an if else logic based on a variable passed at bundle deploy time.

  • 5 kudos
4 More Replies
Oliver_Angelil
by Valued Contributor II
  • 16792 Views
  • 10 replies
  • 1 kudos

How to use the git CLI in databricks?

After making some changes in my feature branch, I have committed and pushed (to Azure Devops) some work (note I have not yet raised a PR or merge to any other branch). Many of the files I committed are data files and so I would like to reverse the co...

  • 16792 Views
  • 10 replies
  • 1 kudos
Latest Reply
turagittech
Contributor
  • 1 kudos

I would love to get an update on this. Git commands would be outstanding in some form. I have the same issue I have changed directory to the workspace. Ls shows the files in the repository, but git status fails.-rwxrwxrwx 1 root root 2386 Feb 13 01:1...

  • 1 kudos
9 More Replies
Katalin555
by New Contributor II
  • 843 Views
  • 1 replies
  • 0 kudos

Found a potential bug in Job Details/Schedule and Trigger section

One of our jobs is scheduled to run at 4:30 AM based on GMT+1 timezone, which is visible if we click on the Edit trigger (Picture1), but under job details it is show as if it was schedule to run at 4:30 AM UTC time (Picture 2).Based on previous runs ...

Katalin555_0-1743586717525.jpeg Katalin555_1-1743586768998.jpeg Katalin555_3-1743587093136.png
  • 843 Views
  • 1 replies
  • 0 kudos
Latest Reply
Isi
Honored Contributor III
  • 0 kudos

Hey @Katalin555 Even though in the “Edit Trigger” panel (Picture 2) the time is shown in local timezone (e.g. GMT+1), once the schedule is saved and viewed under job details (Picture 1), Databricks always displays it as UTC — without making it visual...

  • 0 kudos
Guigui
by New Contributor II
  • 875 Views
  • 2 replies
  • 0 kudos

Package installation for multi-tasks job

I have a job with the same task to be executed twice with two sets of parameters. In each task is run after cloning a git repo then installing it locally and running a notebook from this repo. However, as each task clones the same repo, I was wonderi...

  • 875 Views
  • 2 replies
  • 0 kudos
Latest Reply
Guigui
New Contributor II
  • 0 kudos

That what I've done, but I find it less elegant that setup an environment and sharing it across multiple tasks. It seems to be impossible (unless I build a wheel file and I dont want to) as tasks do not share environment, but anyway, as they run in p...

  • 0 kudos
1 More Replies
Eric_Kieft
by New Contributor III
  • 2758 Views
  • 5 replies
  • 4 kudos

Centralized Location of Table History/Timestamps in Unity Catalog

Is there a centralized location in Unity Catalog that retains the table history, specifically the last timestamp, for managed delta tables?DESCRIBE HISTORY will provide it for a specific table, but I would like to get it for a number of tables.inform...

  • 2758 Views
  • 5 replies
  • 4 kudos
Latest Reply
Priyanka_Biswas
Databricks Employee
  • 4 kudos

Hi @Eric_Kieft @noorbasha534  system.access.table_lineage includes a record for each read or write event on a Unity Catalog table or path. This includes but is not limited to job runs, notebook runs, and dashboards updated with the read or write even...

  • 4 kudos
4 More Replies
William_Scardua
by Valued Contributor
  • 2050 Views
  • 1 replies
  • 1 kudos

Resolved! Upsert from Databricks to CosmosDB

Hi guys,I'm adjusting a data upsert process from Databricks to CosmosDB using the .jar connector. As the load is very large, do you know if it's possible to change only the fields that have been modified?best regards

  • 2050 Views
  • 1 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

Yes, you can update only the modified fields in your Cosmos DB documents from Databricks using the Partial Document Update feature (also known as Patch API). This is particularly useful for large documents where sending the entire document for update...

  • 1 kudos
397973
by New Contributor III
  • 3549 Views
  • 1 replies
  • 1 kudos

Resolved! What's the best way to get from Python dict > JSON > PySpark and apply as a mapping to a dataframe?

I'm migrating code from Python Linux to Databricks PySpark. I have many mappings like this: {    "main": {    "honda": 1.0,    "toyota": 2.9,    "BMW": 5.77,    "Fiat": 4.5,    },}I exported using json.dump, saved to s3 and was able to import with sp...

397973_0-1743620626332.png
  • 3549 Views
  • 1 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

For migrating your Python dictionary mappings to PySpark, you have several good options. Let's examine the approaches and identify the best solution. Using F.create_map (Your Current Approach) Your current approach using `F.create_map` is actually qu...

  • 1 kudos
srinum89
by New Contributor III
  • 972 Views
  • 1 replies
  • 0 kudos

Resolved! Workflow job failing with source as Git Provider (remote github repo) with SP

Facing issue using Github App when running job with source as "Git provider" using Service Principle. Since we can't use PAT with SP on github, I am using Github app for authentication. Followed below documentation but still giving permission issue. ...

  • 972 Views
  • 1 replies
  • 0 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 0 kudos

When running a Databricks workflow with a Git provider source using a Service Principal, you’re encountering permission issues despite using the GitHub App for authentication. This is a common challenge because Service Principals cannot use Personal ...

  • 0 kudos
Labels