cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

Rajdeepak
by New Contributor
  • 261 Views
  • 1 replies
  • 0 kudos

How to restart failed spark stream job from the failure point

I am setting up a ETL process using pyspark. My input is a kafka stream and i am writing output to multiple sink (one into kafka and another into cloud storage). I am writing checkpoints on the cloud storage. The issue i am facing is that, whenever m...

  • 261 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Rajdeepak, To address data redundancy issues caused by reprocessing during application restarts, consider these strategies: Ensure proper checkpointing by configuring and protecting your checkpoint directory; manage Kafka offsets correctly by set...

  • 0 kudos
reachrishav
by New Contributor II
  • 312 Views
  • 1 replies
  • 0 kudos

What is the equivalent of "if exists()" in databricks sql?

What is the equivalent of the below sql server syntax in databricks sql? there are cases where i need to execute a block of sql code on certain conditions. I know this can be achieved with spark.sql, but the problem with spark.sql()  is it does not p...

  • 312 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @reachrishav, In Databricks SQL, you can replicate SQL Server's conditional logic using `CASE` statements and `MERGE` operations. Since Databricks SQL doesn't support `IF EXISTS` directly, you can create a temporary view to check your condition an...

  • 0 kudos
ADB0513
by New Contributor III
  • 420 Views
  • 1 replies
  • 0 kudos

Pass variable from one notebook to another

I have a main notebook where I am setting a python variable to the name of the catalog I want to work in.  I then call another notebook, using %run, which runs an insert into using a SQL command where I want to specify the catalog using the catalog v...

  • 420 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @ADB0513, To pass variables between notebooks in Databricks, you can use three main methods: **Widgets**, where you create and retrieve parameters using `dbutils.widgets` in both notebooks; **spark.conf**, where you set and get configuration param...

  • 0 kudos
Prashanth24
by New Contributor III
  • 329 Views
  • 1 replies
  • 0 kudos

Error connecting Databricks Notebook using managed identity from Azure Data Factory

I am trying to connect Databricks Notebook using managed identity authentication type from Azure Data Factory. Below are the settings done. Error message is appended at the bottom of this message. With the same settings but with different authenticat...

  • 329 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Prashanth24, To resolve this, ensure the resource URL is correctly set, grant the Data Factory Managed Identity Contributor role in the Databricks workspace, verify the Databricks workspace is registered in the correct Azure AD tenant, confirm th...

  • 0 kudos
semsim
by Contributor
  • 269 Views
  • 1 replies
  • 0 kudos

List and iterate over files in Databricks workspace

Hi DE Community,I need to be able to list/iterate over a set of files in a specific directory within the Databricks workspace. For example:"/Workspace/SharedFiles/path/to/file_1"..."/Workspace/SharedFiles/path/to/file_n"Thanks for your direction and ...

  • 269 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Contributor
  • 0 kudos

Hi @semsim ,You can use File system utility (dbutils.fs)Databricks Utilities (dbutils) reference | Databricks on AWSWork with files on Databricks | Databricks on AWSdbutils.fs.ls("file:/Workspace/Users/<user-folder>/")

  • 0 kudos
Zeruno
by New Contributor
  • 215 Views
  • 1 replies
  • 0 kudos

DLT - Get pipeline_id and update_id

I need to insert pipeline_id and update_id in my Delta Live Table (DLT), the point being to know which pipeline created which row. How can I obtain this information?I know you can get job_id and run_id from widgets but I don't know if these are the s...

  • 215 Views
  • 1 replies
  • 0 kudos
Latest Reply
szymon_dybczak
Contributor
  • 0 kudos

Hi @Zeruno ,Those values are rather static. Maybe you can design process that as a first step will extract information from List Pipepline API and save them in delta table.List pipelines | Pipelines API | REST API reference | Databricks on AWSThan in...

  • 0 kudos
vadi
by New Contributor
  • 175 Views
  • 2 replies
  • 0 kudos

csv file processing

whats best possible solution to process csv file in databricks.Please consider scalability,optimization, qa give m best solution...

  • 175 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @vadi, Thanks for reaching out! Please review the response and let us know if it answers your question. Your feedback is valuable to us and the community. If the response resolves your issue, kindly mark it as the accepted solution. This will help...

  • 0 kudos
1 More Replies
Shazaamzaa
by New Contributor III
  • 381 Views
  • 2 replies
  • 0 kudos

Resolved! Setup dbt-core with Azure Entra ID

Hey team, I'm trying to standardize the development environment setup in our team. I've written up a shell script that I want our devs to run in WSL2 after setup. The shell script does the following:1. setup Azure CLI - install and authenticate2. Ins...

  • 381 Views
  • 2 replies
  • 0 kudos
Latest Reply
Shazaamzaa
New Contributor III
  • 0 kudos

Hey @Kaniz_Fatma thanks for the response. I persisted a little more with the logs and the issue appears to be related to WSL2 not having a backend credential manager to handle management of tokens supplied by the OAuth process. To be honest, this is ...

  • 0 kudos
1 More Replies
ckwan48
by New Contributor III
  • 13672 Views
  • 6 replies
  • 3 kudos

Resolved! How to prevent my cluster to shut down after inactivity

Currently, I am running a cluster that is set to terminate after 60 minutes of inactivity. However, in one of my notebooks, one of the cells is still running. How can I prevent this from happening, if want my notebook to run overnight without monito...

  • 13672 Views
  • 6 replies
  • 3 kudos
Latest Reply
AmanSehgal
Honored Contributor III
  • 3 kudos

If a cell is already running ( I assume it's a streaming operation), then I think it doesn't mean that the cluster is inactive. The cluster should be running if a cell is running on it.On the other hand, if you want to keep running your clusters for ...

  • 3 kudos
5 More Replies
acj1459
by New Contributor
  • 153 Views
  • 0 replies
  • 0 kudos

Azure Databricks Data Load

Hi All,I have 10 tables present on On-prem MS SQL DB and want to load 10 table data incrementally into Bronze delta table as append only. From Bronze to Silver , using merge query I want to load latest record into Silver delta table . Whatever latest...

  • 153 Views
  • 0 replies
  • 0 kudos
MRTN
by New Contributor III
  • 3904 Views
  • 3 replies
  • 2 kudos

Resolved! Configure multiple source paths for auto loader

I am currently using two streams to monitor data in two different containers on an Azure storage account. Is there any way to configure an autoloader to read from two different locations? The schemas of the files are identical.

  • 3904 Views
  • 3 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Morten Stakkeland​ :Yes, it's possible to configure an autoloader to read from multiple locations.You can define multiple CloudFiles sources for the autoloader, each pointing to a different container in the same storage account. In your case, since ...

  • 2 kudos
2 More Replies
jfvizoso
by New Contributor II
  • 7288 Views
  • 5 replies
  • 0 kudos

Can I pass parameters to a Delta Live Table pipeline at running time?

I need to execute a DLT pipeline from a Job, and I would like to know if there is any way of passing a parameter. I know you can have settings in the pipeline that you use in the DLT notebook, but it seems you can only assign values to them when crea...

  • 7288 Views
  • 5 replies
  • 0 kudos
Latest Reply
lprevost
Contributor
  • 0 kudos

This seems to be the key to this question:parameterize for dlt  My understanding of this is that you can add the parameter either in the DLT settings UI via Advanced Config/Add Configuration, key, value dialog.   Or via the corresponding pipeline set...

  • 0 kudos
4 More Replies
rushi29
by New Contributor II
  • 356 Views
  • 2 replies
  • 0 kudos

sparkContext in Runtime 15.3

Hello All, Our Azure databricks cluster is running under "Legacy Shared Compute" policy with 15.3 runtime. One of the python notebooks is used to connect to an Azure SQL database to read/insert data. The following snippet of code is responsible for r...

  • 356 Views
  • 2 replies
  • 0 kudos
Latest Reply
rushi29
New Contributor II
  • 0 kudos

Thanks @Kaniz_Fatma for your response. Since, I also need to call stored procedures in the Azure SQL databases from Azure Databricks, I don't think the DataFrames solution would work. When using py4j, how would I create a connection object in Azure D...

  • 0 kudos
1 More Replies
N_M
by Contributor
  • 2133 Views
  • 8 replies
  • 3 kudos

Resolved! use job parameters in scripts

Hi CommunityI made some research, but I wasn't lucky, and I'm a bit surprised I can't find anything about it.So, I would simply access the job parameters when using python scripts (not notebooks).My flow doesn't use notebooks, but I still need to dri...

  • 2133 Views
  • 8 replies
  • 3 kudos
Latest Reply
N_M
Contributor
  • 3 kudos

The only working workaround I found has been provided in another threadRe: Retrieve job-level parameters in Python - Databricks Community - 44720I will repost it here (thanks @julio_resende )You need to push down your parameters to a task level. Eg:C...

  • 3 kudos
7 More Replies
Shiva3
by New Contributor III
  • 231 Views
  • 2 replies
  • 0 kudos

How to know actual size of delta and non-delta tables also the no of files actually exists on S3.

I have set of delta and non-delta tables, their data is on AWS s3, I want to know the total size of my delta and non-delta table in actual excluding files belongs to operations DELETE, VACCUM etc. , also I need to know how much files each delta versi...

  • 231 Views
  • 2 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Shiva3, To manage the size of Delta and non-Delta tables on AWS S3, excluding irrelevant files, start by using `DESCRIBE HISTORY` to monitor Delta table metrics and `VACUUM` to clean up old files, setting a retention period as needed. For non-Del...

  • 0 kudos
1 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels