Data Engineering

Forum Posts

Sorted by:

by Dipesh • New Contributor II

01-31-2023 6:19:58 AM

2435 Views
1 replies
1 kudos

Resolved! Bulk updating Delta tables in Databricks

Hi All,I have some data in Delta table with multiple columns and each record has a unique identifier.I want to update some columns as per the new values coming in for each of these unique records. However updating one record at a time is taking a lot...

Data Engineering

2435 Views
1 replies
1 kudos

01-31-2023 6:19:58 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-31-2023 11:12:31 AM

1 kudos

yes by using MERGE statment

1 kudos

01-31-2023 11:12:31 AM

by venkat09 • New Contributor III

01-31-2023 10:01:51 AM

1349 Views
1 replies
1 kudos

Schema Evolution - Auto Loader for Avro format is not working as expected

* Reading Avro files from s3 and then writing to the delta table * Ingested sample data of 10 files, which contain four columns, and it infers the schema automatically as expected * Introducing a new file which contains a new column [foo] along wi...

Data Engineering

1349 Views
1 replies
1 kudos

01-31-2023 10:01:51 AM

View Replies

Latest Reply

venkat09
New Contributor III

01-31-2023 11:06:32 AM

1 kudos

I am attaching the sample code notebook that helps to reproduce the issue.

1 kudos

01-31-2023 11:06:32 AM

by KuldeepChitraka • New Contributor III

01-31-2023 8:08:58 AM

1856 Views
3 replies
6 kudos

Performance Issue : Create DELTA table form 2 TB PARQUET file

We are trying to create a DELTA table (CTAS statement) from 2 TB PARQUET file and its taking huge amount of time around 12~ hrs.is it normal.? What are option to tune/optimize this ? are we doing anything wrongCluster : Interactive/30 Cores / 320 GB ...

Data Engineering

1856 Views
3 replies
6 kudos

01-31-2023 8:08:58 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

01-31-2023 10:58:05 AM

6 kudos

Please use COPY INTO (first create an empty delta table) or CONVERT TO DELTA instead of CTAS it will be much more faster, and it process will be auto-optimized.

6 kudos

01-31-2023 10:58:05 AM

2 More Replies

by mriccardi • New Contributor II

12-01-2022 11:12:26 AM

3590 Views
1 replies
0 kudos

Structured Streaming Checkpoint corrupted.

Hello,We are experiencing an error with one Structured Streaming Job that we have, that basically the checkpoint gets corrupted and we are unable to continue with the execution.I've checked the errors and this happens when it triggers an autocompact,...

Data Engineering

3590 Views
1 replies
0 kudos

12-01-2022 11:12:26 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

01-31-2023 9:14:11 AM

0 kudos

Hi @Martin Riccardi,Could you share the following please:1) whats your Source?2) whats your Sink?3) could you share your readStream() and writeStream() code?4) full error stack trace5) did you stop and re-run your query after weeks of not being acti...

0 kudos

01-31-2023 9:14:11 AM

by Sameer_876675 • New Contributor III

12-07-2022 4:22:17 AM

5483 Views
3 replies
2 kudos

How to efficiently process a 100GiB JSON nested file and store it in Delta?

Hi, I'm a fairly new user and I am using Azure Databricks to process a ~1000GiB JSON nested file containing insurance policy data. I uploaded the JSON file to Azure Data Lake Gen2 storage and read the JSON file into a dataframe.df=spark.read.option("...

Data Engineering

5483 Views
3 replies
2 kudos

12-07-2022 4:22:17 AM

View Replies

Latest Reply

Annapurna_Hiriy
Databricks Employee

01-31-2023 8:20:49 AM

2 kudos

Hi Sameer, please refer to following documents on how to work with nested json:https://docs.databricks.com/optimizations/semi-structured.htmlhttps://learn.microsoft.com/en-us/azure/databricks/kb/_static/notebooks/scala/nested-json-to-dataframe.html

2 kudos

01-31-2023 8:20:49 AM

2 More Replies

by pramalin • New Contributor

01-30-2023 10:33:47 AM

3344 Views
3 replies
2 kudos

How to perform Inner join using withcolumn

Data Engineering

3344 Views
3 replies
2 kudos

01-30-2023 10:33:47 AM

View Replies

Latest Reply

shan_chandra
Databricks Employee

01-31-2023 7:55:15 AM

2 kudos

@prudhvi ramalingam - Please refer to the below example code.import org.apache.spark.sql.functions.expr val person = Seq( (0, "Bill Chambers", 0, Seq(100)), (1, "Matei Zaharia", 1, Seq(500, 250, 100)), (2, "Michael Armbrust", 1, Seq(250,...

2 kudos

01-31-2023 7:55:15 AM

2 More Replies

by KVNARK • Honored Contributor II

01-31-2023 3:35:58 AM

1726 Views
2 replies
2 kudos

Encrypt in azure SQL DB and decrypt in Power BI

If some columns are encrypted in Azure SQL DB.I need to decrypt them in Power BI.Are there any pre-requisites to implement this.

Data Engineering

1726 Views
2 replies
2 kudos

01-31-2023 3:35:58 AM

View Replies

Latest Reply

Nhan_Nguyen
Valued Contributor

01-31-2023 5:11:54 AM

2 kudos

Could you describe more detail your case?

2 kudos

01-31-2023 5:11:54 AM

1 More Replies

by LidorAbo • New Contributor II

01-31-2023 1:09:02 AM

2333 Views
1 replies
0 kudos

Databricks can write to s3 bucket through panda but not from spark

Hey,I have problem with access to s3 bucket using cross account bucket permission, i got the following error:Steps to repreduce:Checking the role that assoicated to ec2 instance:{ "Version": "2012-10-17", "Statement": [ { ...

Data Engineering

2333 Views
1 replies
0 kudos

01-31-2023 1:09:02 AM

View Replies

Latest Reply

Nhan_Nguyen
Valued Contributor

01-31-2023 5:17:32 AM

0 kudos

Could you try to map s3 bucket location with Databricks File System then write output to this new location instead of directly write to S3 location.

0 kudos

01-31-2023 5:17:32 AM

by sedat • New Contributor II

01-30-2023 8:08:52 AM

2189 Views
2 replies
2 kudos

Hi, is there any document for databricks about performance tuning and reporting?

Hi, I need to analyse performance issues for databricks. Is there any document or monitoring tool to run to see what is happening in databricks? I am very new in databricks. Best

Data Engineering

2189 Views
2 replies
2 kudos

01-30-2023 8:08:52 AM

View Replies

Latest Reply

Nhan_Nguyen
Valued Contributor

01-31-2023 5:14:30 AM

2 kudos

You could try some courses in "https://customer-academy.databricks.com/"What's New In Apache Spark 3.0Optimizing Apache Spark on Databricks

2 kudos

01-31-2023 5:14:30 AM

1 More Replies

by Callum • New Contributor II

12-01-2022 7:05:53 AM

13036 Views
3 replies
2 kudos

Pyspark Pandas column or index name appears to persist after being dropped or removed.

So, I have this code for merging dataframes with pyspark pandas. And I want the index of the left dataframe to persist throughout the joins. So following suggestions from others wanting to keep the index after merging, I set the index to a column bef...

Data Engineering

13036 Views
3 replies
2 kudos

12-01-2022 7:05:53 AM

View Replies

Latest Reply

Serlal
New Contributor III

01-31-2023 3:01:12 AM

2 kudos

Hi!I tried debugging your code and I think that the error you get is simply because the column exists in two instances of your dataframe within your loop.I tried adding some extra debug lines in your merge_dataframes function:and after executing that...

2 kudos

01-31-2023 3:01:12 AM

2 More Replies

by sonalitotade • New Contributor II

01-18-2023 7:58:25 AM

2048 Views
2 replies
0 kudos

Capture events such as Start, Stop and Terminate of cluster.

Hi,I am using databricks with AWS.I need to capture events such as Start, Stop and Terminate of cluster and perform some other action based on the events that happened on the cluster.Is there a way I can achieve this in databricks?

Data Engineering

2048 Views
2 replies
0 kudos

01-18-2023 7:58:25 AM

View Replies

Latest Reply

sonalitotade
New Contributor II

01-31-2023 1:06:33 AM

0 kudos

Hi Daniel, thanks for the responseI would like to know if we can capture the event logs as shown in the image below when an event occurs on the cluster.

0 kudos

01-31-2023 1:06:33 AM

1 More Replies

by KVNARK • Honored Contributor II

01-30-2023 7:56:46 PM

15588 Views
2 replies
5 kudos

Resolved! pyspark optimizations and best practices

What and all we can implement maximum to attain the best optimization and which are all the best practices using PySpark end to end.

Data Engineering

15588 Views
2 replies
5 kudos

01-30-2023 7:56:46 PM

View Replies

Latest Reply

daniel_sahal
Esteemed Contributor

01-30-2023 11:55:41 PM

5 kudos

@KVNARK . This video is cool.https://www.youtube.com/watch?v=daXEp4HmS-E

5 kudos

01-30-2023 11:55:41 PM

1 More Replies

by Gandham • New Contributor II

01-28-2023 10:30:02 AM

4208 Views
3 replies
2 kudos

Maven Libraries are failing on restarting the cluster.

I have installed "com.databricks:spark-xml_2.12:0.16.0" maven libraries to a cluster. The installation was successful. But when I restart the cluster, even this successful installation becomes failed. This happens with all Maven Libraries. Here is th...

Data Engineering

4208 Views
3 replies
2 kudos

01-28-2023 10:30:02 AM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

01-30-2023 7:37:26 PM

2 kudos

it is intermittent issue, we also faced this issue earlier ,try to upgrade DBR version

2 kudos

01-30-2023 7:37:26 PM

2 More Replies

by Therdpong • New Contributor III

01-18-2023 8:22:41 AM

2068 Views
2 replies
0 kudos

how to check what jobs cluster to have expanddisk.

We would like to know how to check what jobs cluster to have to expand disk.

Data Engineering

2068 Views
2 replies
0 kudos

01-18-2023 8:22:41 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

01-30-2023 2:40:04 PM

0 kudos

You can check in the cluster's event logs. You can type in the search box, "disk" and you will see all the events there.

0 kudos

01-30-2023 2:40:04 PM

1 More Replies

by SS2 • Valued Contributor

11-29-2022 12:06:54 PM

2066 Views
2 replies
1 kudos

Spark out of memory error.

Sometimes in Databricks you can see the out of memory error then in that case you can change the cluster size. As per requirement to resolve the issue.

Data Engineering

2066 Views
2 replies
1 kudos

11-29-2022 12:06:54 PM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

01-30-2023 4:38:22 PM

1 kudos

Hi @S S,Could you provide more details on your issue? for example, error stack traces, code snippet, etc. We will be able to help you if you share more details

1 kudos

01-30-2023 4:38:22 PM

1 More Replies

User

Count

1611

768

345

286

252

Databricks Community

Forum Posts

Resolved! Bulk updating Delta tables in Databricks

Schema Evolution - Auto Loader for Avro format is not working as expected

Performance Issue : Create DELTA table form 2 TB PARQUET file

Structured Streaming Checkpoint corrupted.

How to efficiently process a 100GiB JSON nested file and store it in Delta?

How to perform Inner join using withcolumn

Encrypt in azure SQL DB and decrypt in Power BI

Databricks can write to s3 bucket through panda but not from spark

Hi, is there any document for databricks about performance tuning and reporting?

Pyspark Pandas column or index name appears to persist after being dropped or removed.

Capture events such as Start, Stop and Terminate of cluster.

Resolved! pyspark optimizations and best practices

Maven Libraries are failing on restarting the cluster.

how to check what jobs cluster to have expanddisk.

Spark out of memory error.

Join Us as a Local Community Builder!

Databricks data engineer associate exam

How to delete/empty notebook output

Databricks Cluster Policies

toml file syntax highlighting

Materialized Views Compute