cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Rajdeepak
by New Contributor
  • 1717 Views
  • 1 replies
  • 0 kudos

How to restart failed spark stream job from the failure point

I am setting up a ETL process using pyspark. My input is a kafka stream and i am writing output to multiple sink (one into kafka and another into cloud storage). I am writing checkpoints on the cloud storage. The issue i am facing is that, whenever m...

  • 1717 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @Rajdeepak, To address data redundancy issues caused by reprocessing during application restarts, consider these strategies: Ensure proper checkpointing by configuring and protecting your checkpoint directory; manage Kafka offsets correctly by set...

  • 0 kudos
reachrishav
by New Contributor II
  • 1679 Views
  • 1 replies
  • 0 kudos

What is the equivalent of "if exists()" in databricks sql?

What is the equivalent of the below sql server syntax in databricks sql? there are cases where i need to execute a block of sql code on certain conditions. I know this can be achieved with spark.sql, but the problem with spark.sql()  is it does not p...

  • 1679 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @reachrishav, In Databricks SQL, you can replicate SQL Server's conditional logic using `CASE` statements and `MERGE` operations. Since Databricks SQL doesn't support `IF EXISTS` directly, you can create a temporary view to check your condition an...

  • 0 kudos
ADB0513
by New Contributor III
  • 2669 Views
  • 1 replies
  • 0 kudos

Pass variable from one notebook to another

I have a main notebook where I am setting a python variable to the name of the catalog I want to work in.  I then call another notebook, using %run, which runs an insert into using a SQL command where I want to specify the catalog using the catalog v...

  • 2669 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @ADB0513, To pass variables between notebooks in Databricks, you can use three main methods: **Widgets**, where you create and retrieve parameters using `dbutils.widgets` in both notebooks; **spark.conf**, where you set and get configuration param...

  • 0 kudos
semsim
by Contributor
  • 3590 Views
  • 1 replies
  • 1 kudos

List and iterate over files in Databricks workspace

Hi DE Community,I need to be able to list/iterate over a set of files in a specific directory within the Databricks workspace. For example:"/Workspace/SharedFiles/path/to/file_1"..."/Workspace/SharedFiles/path/to/file_n"Thanks for your direction and ...

  • 3590 Views
  • 1 replies
  • 1 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 1 kudos

Hi @semsim ,You can use File system utility (dbutils.fs)Databricks Utilities (dbutils) reference | Databricks on AWSWork with files on Databricks | Databricks on AWSdbutils.fs.ls("file:/Workspace/Users/<user-folder>/")

  • 1 kudos
Zeruno
by New Contributor II
  • 1810 Views
  • 1 replies
  • 0 kudos

DLT - Get pipeline_id and update_id

I need to insert pipeline_id and update_id in my Delta Live Table (DLT), the point being to know which pipeline created which row. How can I obtain this information?I know you can get job_id and run_id from widgets but I don't know if these are the s...

  • 1810 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 0 kudos

Hi @Zeruno ,Those values are rather static. Maybe you can design process that as a first step will extract information from List Pipepline API and save them in delta table.List pipelines | Pipelines API | REST API reference | Databricks on AWSThan in...

  • 0 kudos
vadi
by New Contributor
  • 670 Views
  • 2 replies
  • 0 kudos

csv file processing

whats best possible solution to process csv file in databricks.Please consider scalability,optimization, qa give m best solution...

  • 670 Views
  • 2 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @vadi, Thanks for reaching out! Please review the response and let us know if it answers your question. Your feedback is valuable to us and the community. If the response resolves your issue, kindly mark it as the accepted solution. This will help...

  • 0 kudos
1 More Replies
Shazaamzaa
by New Contributor III
  • 1915 Views
  • 1 replies
  • 0 kudos

Setup dbt-core with Azure Entra ID

Hey team, I'm trying to standardize the development environment setup in our team. I've written up a shell script that I want our devs to run in WSL2 after setup. The shell script does the following:1. setup Azure CLI - install and authenticate2. Ins...

  • 1915 Views
  • 1 replies
  • 0 kudos
Latest Reply
Shazaamzaa
New Contributor III
  • 0 kudos

Hey @Retired_mod thanks for the response. I persisted a little more with the logs and the issue appears to be related to WSL2 not having a backend credential manager to handle management of tokens supplied by the OAuth process. To be honest, this is ...

  • 0 kudos
acj1459
by New Contributor
  • 595 Views
  • 0 replies
  • 0 kudos

Azure Databricks Data Load

Hi All,I have 10 tables present on On-prem MS SQL DB and want to load 10 table data incrementally into Bronze delta table as append only. From Bronze to Silver , using merge query I want to load latest record into Silver delta table . Whatever latest...

  • 595 Views
  • 0 replies
  • 0 kudos
MRTN
by New Contributor III
  • 6424 Views
  • 3 replies
  • 2 kudos

Resolved! Configure multiple source paths for auto loader

I am currently using two streams to monitor data in two different containers on an Azure storage account. Is there any way to configure an autoloader to read from two different locations? The schemas of the files are identical.

  • 6424 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Morten Stakkeland​ :Yes, it's possible to configure an autoloader to read from multiple locations.You can define multiple CloudFiles sources for the autoloader, each pointing to a different container in the same storage account. In your case, since ...

  • 2 kudos
2 More Replies
N_M
by Contributor
  • 21737 Views
  • 7 replies
  • 4 kudos

Resolved! use job parameters in scripts

Hi CommunityI made some research, but I wasn't lucky, and I'm a bit surprised I can't find anything about it.So, I would simply access the job parameters when using python scripts (not notebooks).My flow doesn't use notebooks, but I still need to dri...

  • 21737 Views
  • 7 replies
  • 4 kudos
Latest Reply
N_M
Contributor
  • 4 kudos

The only working workaround I found has been provided in another threadRe: Retrieve job-level parameters in Python - Databricks Community - 44720I will repost it here (thanks @julio_resende )You need to push down your parameters to a task level. Eg:C...

  • 4 kudos
6 More Replies
Shiva3
by New Contributor III
  • 1313 Views
  • 2 replies
  • 0 kudos

How to know actual size of delta and non-delta tables also the no of files actually exists on S3.

I have set of delta and non-delta tables, their data is on AWS s3, I want to know the total size of my delta and non-delta table in actual excluding files belongs to operations DELETE, VACCUM etc. , also I need to know how much files each delta versi...

  • 1313 Views
  • 2 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @Shiva3, To manage the size of Delta and non-Delta tables on AWS S3, excluding irrelevant files, start by using `DESCRIBE HISTORY` to monitor Delta table metrics and `VACUUM` to clean up old files, setting a retention period as needed. For non-Del...

  • 0 kudos
1 More Replies
a-sky
by New Contributor II
  • 2013 Views
  • 1 replies
  • 1 kudos

Databricks job stalls without error, unable to pin-point error, all compute metrics seem ok

I have a job that gets stuck on "Determining DBIO File fragment" and I have not been able to figure out why this job keeps getting stuck. I monitor the job cluster metrics through out the job and it doesnt seem like its hitting any bottlenecks with m...

asky_0-1721405223209.png asky_1-1721404695718.png asky_2-1721404734997.png asky_3-1721404753865.png
  • 2013 Views
  • 1 replies
  • 1 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 1 kudos

Hi @a-sky, This message indicates that Databricks is figuring out which file fragments are cached, which can be slow, especially with frequent cluster scaling. To address this, you can try disabling delta caching with `spark.conf.set("spark.databrick...

  • 1 kudos
DMehmuda
by New Contributor
  • 2114 Views
  • 1 replies
  • 0 kudos

Issue with round off value while loading to delta table

I have a float dataype column in delta table and data to be loaded should be rounded off to 2 decimal places. I'm casting the column to DECIMAL(18,10) type and then using round function from pyspark.sql.function for rounding off values to 2 decimal p...

  • 2114 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @DMehmuda, The issue arises because floating-point numbers in Delta tables can retain more decimal places than expected. To ensure values are stored with the correct precision, explicitly cast the column to `DECIMAL(18,2)` before writing to the De...

  • 0 kudos
prem14f
by New Contributor II
  • 1340 Views
  • 1 replies
  • 0 kudos

Handling Concurrent Writes to a Delta Table by delta-rs and Databricks Spark Job

Hi @dennyglee, @Retired_mod.If I am writing data into a Delta table using delta-rs and a Databricks job, but I lose some transactions, how can I handle this?Given that Databricks runs a commit service and delta-rs uses DynamoDB for transaction logs, ...

  • 1340 Views
  • 1 replies
  • 0 kudos
Latest Reply
Retired_mod
Esteemed Contributor III
  • 0 kudos

Hi @prem14f, To manage lost transactions, implement retry logic with automatic retries and ensure idempotent writes to avoid duplication. For concurrent writers, use optimistic concurrency control, which allows for conflict detection and resolution d...

  • 0 kudos

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels