Data Engineering

Forum Posts

Sorted by:

by anish2102 • New Contributor II

3 weeks ago

356 Views
4 replies
1 kudos

Resolved! Pyspark operations slowness in CLuster 14.3LTS as compared to 13.3 LTS

In my notebook, i am performing few join operations which are taking more than 30s in cluster 14.3 LTS where same operation is taking less than 4s in 13.3 LTS cluster. Can someone help me how can i optimize pyspark operations like joins and withColum...

Data Engineering

clustr-14.3

spark-3.5

356 Views
4 replies
1 kudos

3 weeks ago

View Replies

Latest Reply

Lakshay
Esteemed Contributor

a week ago

1 kudos

Thank you for sharing the analysis

1 kudos

a week ago

3 More Replies

by SG • New Contributor II

07-20-2023 1:01:53 PM

615 Views
3 replies
1 kudos

Customize job run name when running jobs from adf

Hi guys, i am running my Databricks jobs on a cluster job from azure datafactory using a databricks Python activity When I monitor my jobs in workflow-> job runs . I see that the run name is a concatenation of adf pipeline name , Databricks python ac...

Data Engineering

615 Views
3 replies
1 kudos

07-20-2023 1:01:53 PM

View Replies

Latest Reply

AmanSehgal
Honored Contributor III

a week ago

1 kudos

I don't think that level of customisation is provided. However, I can suggest some workarounds:REST API: Create a job on the fly with desired name within ADF and trigger it using REST API in Web activity. This way you can track job completion status ...

1 kudos

a week ago

2 More Replies

by Mohit_m • Valued Contributor II

06-27-2022 5:24:40 AM

1284 Views
2 replies
3 kudos

Resolved! Could not initialize class error

User is running a job triggered from ADF in Databricks. In this job they need to use custom libraries that are in jars. Most of the times jobs are running fine, however sometimes it fails with:java.lang.NoClassDefFoundError: Could not initializeAny s...

Data Engineering

1284 Views
2 replies
3 kudos

06-27-2022 5:24:40 AM

View Replies

Latest Reply

Mohit_m
Valued Contributor II

06-27-2022 5:25:15 AM

3 kudos

Can you please check if there are more than one jar containing this class . If multiple jars of the same type are available on the cluster, then there is no guarantee of JVM picking the proper classes for processing, which results in the intermittent...

3 kudos

06-27-2022 5:25:15 AM

1 More Replies

by Jorge3 • New Contributor III

a week ago

410 Views
3 replies
2 kudos

Resolved! [Databricks Assets Bundles] Workflow trigger on file arrival

Hi everyone!I'm setting up a workflow using Databricks Assets Bundles (DABs). And I want to configure my workflow to be trigger on file arrival. However all the examples I've found in the documentation use schedule triggers. Does anyone know if it is...

Data Engineering

410 Views
3 replies
2 kudos

a week ago

View Replies

Latest Reply

Ajay-Pandey
Esteemed Contributor III

a week ago

2 kudos

Hi @Jorge3 Yes, you can use continues mode also.Please find syntax below - resources: jobs: dbx_job: name: continuous_job_name continuous: pause_status: UNPAUSED queue: enabled: true

2 kudos

a week ago

2 More Replies

by ismaelhenzel • New Contributor II

4 weeks ago

804 Views
2 replies
2 kudos

Resolved! Addressing Pipeline Error Handling in Databricks bundle run with CI/CD when SUCCESS WITH FAILURES

I'm using Databricks asset bundles and I have pipelines that contain "if all done rules". When running on CI/CD, if a task fails, the pipeline returns a message like "the job xxxx SUCCESS_WITH_FAILURES" and it passes, potentially deploying a broken p...

Data Engineering

bunlde

CICD

Databricks

804 Views
2 replies
2 kudos

4 weeks ago

View Replies

Latest Reply

ismaelhenzel
New Contributor II

a week ago

2 kudos

Awesome answer, I will try the first approach. I think it is a less intrusive solution than changing the rules of my pipeline in development scenarios. This way, I can maintain a general pipeline for deployment across all environments. We plan to imp...

2 kudos

a week ago

1 More Replies

by smedegaard • New Contributor III

2 weeks ago

175 Views
2 replies
1 kudos

[delta live tabel] exception: getPrimaryKeys not implemented for debezium

I've defined a streaming deltlive table in a notebook using python.running on "preview" channeldelta cache accelerated (Standard_D4ads_v5) computeIt fails withorg.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = xxx, ru...

Data Engineering

175 Views
2 replies
1 kudos

2 weeks ago

View Replies

Latest Reply

Kaniz
Community Manager

a week ago

1 kudos

Hi @smedegaard, You’re encountering a StreamingQueryException with the message: “getPrimaryKeys not implemented for debezium SQLSTATE: XXKST.” This error suggests that the getPrimaryKeys operation is not supported for the Debezium connector in your ...

1 kudos

a week ago

1 More Replies

by Phani1 • Valued Contributor

2 weeks ago

142 Views
1 replies
0 kudos

Boomi integrating with Databricks

Hi Team,Is there any impact when integrating Databricks with Boomi as opposed to Azure Event Hub? Could you offer some insights on the integration of Boomi with Databricks?https://boomi.com/blog/introducing-boomi-event-streams/Regards,Janga

Data Engineering

delta

142 Views
1 replies
0 kudos

2 weeks ago

View Replies

Latest Reply

Kaniz
Community Manager

a week ago

0 kudos

Hi @Phani1, Let’s explore the integration of Databricks with Boomi and compare it to Azure Event Hub. Databricks Integration with Boomi: Databricks is a powerful data analytics platform that allows you to process large-scale data and build machin...

0 kudos

a week ago

by ETLdeveloper • New Contributor II

2 weeks ago

613 Views
1 replies
0 kudos

Resolved! I have to run the notebook in concurrently using process pool executor in python

Hello All,My scenario required me to create a code that reads tables from the source catalog and writes them to the destination catalog using Spark. Doing one by one is not a good option when there are 300 tables in the catalog. So I am trying the pr...

Data Engineering

613 Views
1 replies
0 kudos

2 weeks ago

View Replies

Latest Reply

Ajay-Pandey
Esteemed Contributor III

a week ago

0 kudos

Hi @ETLdeveloper You can use the multithreading that help you to run notebook in parallel.Attaching code for your reference - from concurrent.futures import ThreadPoolExecutor class NotebookData: def __init__(self, path, timeout, parameters = Non...

0 kudos

a week ago

by TitaMn • New Contributor

2 weeks ago

146 Views
1 replies
0 kudos

AzureDevOps and Databricks Connection using managed identity or service principal

Hi All! Im in a project where i need to connect azure devops and databricks using managed identity to avoid the using of service account, PAT, etc.The thing is i cant move forward with the connection since i cannot take the ownership of the files wh...

Data Engineering

146 Views
1 replies
0 kudos

2 weeks ago

View Replies

Latest Reply

Kaniz
Community Manager

a week ago

0 kudos

Hi @TitaMn, Connecting Azure DevOps and Azure Databricks using managed identity is a great approach to enhance security and avoid using service accounts or personal access tokens (PATs). Let’s explore some options: Azure Managed Identity for Dat...

0 kudos

a week ago

by Anske • New Contributor II

2 weeks ago

186 Views
4 replies
0 kudos

how to stop dataframe with federated table source to be reevaluated when referenced (cache?)

Hi,Would anyone happen to know whether it's possible to cache a dataframe in memory that the result of a query on a federated table?I have a notebook that queries a federated table, does some transformations on the dataframe and then writes this data...

Data Engineering

186 Views
4 replies
0 kudos

2 weeks ago

View Replies

Latest Reply

Anske
New Contributor II

a week ago

0 kudos

@daniel_sahal , this is the code snippet:lsn_incr_batch = spark.sql(f"""select start_lsn,tran_begin_time,tran_end_time,tran_id,tran_begin_lsn,cast('{current_run_ts}' as timestamp) as appendedfrom externaldb.cdc.lsn_time_mappingwhere tran_end_time > '...

0 kudos

a week ago

3 More Replies

by CarstenWeber • New Contributor II

2 weeks ago

283 Views
4 replies
1 kudos

Resolved! Invalid configuration fs.azure.account.key trying to load ML Model with OAuth

Hi Community,i was trying to load a ML Model from a Azure Storageaccount (abfss://....) with: model = PipelineModel.load(path) i set the spark config: spark.conf.set("fs.azure.account.auth.type", "OAuth") spark.conf.set("fs.azure.account.oauth.provi...

Data Engineering

283 Views
4 replies
1 kudos

2 weeks ago

View Replies

Latest Reply

CarstenWeber
New Contributor II

a week ago

1 kudos

@daniel_sahal using the settings above did indeed work.

1 kudos

a week ago

3 More Replies

by amar1995 • New Contributor II

2 weeks ago

421 Views
4 replies
0 kudos

Performance Issue with XML Processing in Spark Databricks

I am reaching out to bring attention to a performance issue we are encountering while processing XML files using Spark-XML, particularly with the configuration spark.read().format("com.databricks.spark.xml").Currently, we are experiencing significant...

Data Engineering

421 Views
4 replies
0 kudos

2 weeks ago

View Replies

Latest Reply

shan_chandra
Honored Contributor III

2 weeks ago

0 kudos

@amar1995 - Can you try this streaming approach and see if it works for your use case (using autoloader) - https://kb.databricks.com/streaming/stream-xml-auto-loader

0 kudos

2 weeks ago

3 More Replies

by johnp • New Contributor II

2 weeks ago

151 Views
1 replies
0 kudos

Call databricks notebook from azure flask app

I have an Azure web app running flask web server. From flask server, I want to run some queries on the data stored in ADLS Gen2 storage. I already created Databricks notebooks running these queries. The flask server will pass some parameters in ...

Data Engineering

151 Views
1 replies
0 kudos

2 weeks ago

View Replies

Latest Reply

feiyun0112
Contributor III

a week ago

0 kudos

you can use databricks SDKhttps://docs.databricks.com/en/dev-tools/sdk-python.html#create-a-job

0 kudos

a week ago

by Kanti1989 • New Contributor II

2 weeks ago

440 Views
4 replies
0 kudos

Pyspark execution error

I am getting a error message when executing a simple pyspark code. Can anyone help me with this.

Data Engineering

440 Views
4 replies
0 kudos

2 weeks ago

View Replies

Latest Reply

AmanSehgal
Honored Contributor III

2 weeks ago

0 kudos

Could you please share the entire error message?Are you running the code locally or on databricks?

0 kudos

2 weeks ago

3 More Replies

by data-grassroots • New Contributor II

2 weeks ago

558 Views
6 replies
1 kudos

Resolved! Ingesting Files - Same file name, modified content

We have a data feed with files whose filenames stays the same but the contents change over time (brand_a.csv, brand_b.csv, brand_c.csv ....).Copy Into seems to ignore the files when they change.If we set the Force flag to true and run it, we end up w...

Data Engineering

558 Views
6 replies
1 kudos

2 weeks ago

View Replies

Latest Reply

data-grassroots
New Contributor II

2 weeks ago

1 kudos

Thanks for the validation, Werners! That's the path we've been heading down (copy + merge). I still have some DLT experiments planned but - at least for this situation - copy + merge works just fine.

1 kudos

2 weeks ago

5 More Replies

User

Count

1602

736

344

284

247

Databricks

Forum Posts

Resolved! Pyspark operations slowness in CLuster 14.3LTS as compared to 13.3 LTS

Customize job run name when running jobs from adf

Resolved! Could not initialize class error

Resolved! [Databricks Assets Bundles] Workflow trigger on file arrival

Resolved! Addressing Pipeline Error Handling in Databricks bundle run with CI/CD when SUCCESS WITH FAILURES

[delta live tabel] exception: getPrimaryKeys not implemented for debezium

Boomi integrating with Databricks

Resolved! I have to run the notebook in concurrently using process pool executor in python

AzureDevOps and Databricks Connection using managed identity or service principal

how to stop dataframe with federated table source to be reevaluated when referenced (cache?)

Resolved! Invalid configuration fs.azure.account.key trying to load ML Model with OAuth

Performance Issue with XML Processing in Spark Databricks

Call databricks notebook from azure flask app

Pyspark execution error

Resolved! Ingesting Files - Same file name, modified content

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...