cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Dipesh
by New Contributor II
  • 2435 Views
  • 1 replies
  • 1 kudos

Resolved! Bulk updating Delta tables in Databricks

Hi All,I have some data in Delta table with multiple columns and each record has a unique identifier.I want to update some columns as per the new values coming in for each of these unique records. However updating one record at a time is taking a lot...

  • 2435 Views
  • 1 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

yes by using MERGE statment

  • 1 kudos
venkat09
by New Contributor III
  • 1349 Views
  • 1 replies
  • 1 kudos

Schema Evolution - Auto Loader for Avro format is not working as expected

* Reading Avro files from s3 and then writing to the delta table * Ingested sample data of 10 files, which contain four columns, and it infers the schema automatically as expected * Introducing a new file which contains a new column [foo] along wi...

  • 1349 Views
  • 1 replies
  • 1 kudos
Latest Reply
venkat09
New Contributor III
  • 1 kudos

I am attaching the sample code notebook that helps to reproduce the issue.

  • 1 kudos
KuldeepChitraka
by New Contributor III
  • 1856 Views
  • 3 replies
  • 6 kudos

Performance Issue : Create DELTA table form 2 TB PARQUET file

We are trying to create a DELTA table (CTAS statement) from 2 TB PARQUET file and its taking huge amount of time around 12~ hrs.is it normal.? What are option to tune/optimize this ? are we doing anything wrongCluster : Interactive/30 Cores / 320 GB ...

  • 1856 Views
  • 3 replies
  • 6 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 6 kudos

Please use COPY INTO (first create an empty delta table) or CONVERT TO DELTA instead of CTAS it will be much more faster, and it process will be auto-optimized.

  • 6 kudos
2 More Replies
mriccardi
by New Contributor II
  • 3590 Views
  • 1 replies
  • 0 kudos

Structured Streaming Checkpoint corrupted.

Hello,We are experiencing an error with one Structured Streaming Job that we have, that basically the checkpoint gets corrupted and we are unable to continue with the execution.I've checked the errors and this happens when it triggers an autocompact,...

  • 3590 Views
  • 1 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

Hi @Martin Riccardi​,Could you share the following please:1) whats your Source?2) whats your Sink?3) could you share your readStream() and writeStream() code?4) full error stack trace5) did you stop and re-run your query after weeks of not being acti...

  • 0 kudos
Sameer_876675
by New Contributor III
  • 5483 Views
  • 3 replies
  • 2 kudos

How to efficiently process a 100GiB JSON nested file and store it in Delta?

Hi, I'm a fairly new user and I am using Azure Databricks to process a ~1000GiB JSON nested file containing insurance policy data. I uploaded the JSON file to Azure Data Lake Gen2 storage and read the JSON file into a dataframe.df=spark.read.option("...

Cluster Summary OOM Error
  • 5483 Views
  • 3 replies
  • 2 kudos
Latest Reply
Annapurna_Hiriy
Databricks Employee
  • 2 kudos

Hi Sameer, please refer to following documents on how to work with nested json:https://docs.databricks.com/optimizations/semi-structured.htmlhttps://learn.microsoft.com/en-us/azure/databricks/kb/_static/notebooks/scala/nested-json-to-dataframe.html

  • 2 kudos
2 More Replies
pramalin
by New Contributor
  • 3344 Views
  • 3 replies
  • 2 kudos
  • 3344 Views
  • 3 replies
  • 2 kudos
Latest Reply
shan_chandra
Databricks Employee
  • 2 kudos

@prudhvi ramalingam​ - Please refer to the below example code.import org.apache.spark.sql.functions.expr val person = Seq( (0, "Bill Chambers", 0, Seq(100)), (1, "Matei Zaharia", 1, Seq(500, 250, 100)), (2, "Michael Armbrust", 1, Seq(250,...

  • 2 kudos
2 More Replies
KVNARK
by Honored Contributor II
  • 1726 Views
  • 2 replies
  • 2 kudos

Encrypt in azure SQL DB and decrypt in Power BI

If some columns are encrypted in Azure SQL DB.I need to decrypt them in Power BI.Are there any pre-requisites to implement this.

  • 1726 Views
  • 2 replies
  • 2 kudos
Latest Reply
Nhan_Nguyen
Valued Contributor
  • 2 kudos

Could you describe more detail your case?

  • 2 kudos
1 More Replies
LidorAbo
by New Contributor II
  • 2333 Views
  • 1 replies
  • 0 kudos

Databricks can write to s3 bucket through panda but not from spark

Hey,I have problem with access to s3 bucket using cross account bucket permission, i got the following error:Steps to repreduce:Checking the role that assoicated to ec2 instance:{ "Version": "2012-10-17", "Statement": [ { ...

Access_Denied_S3_Bucket
  • 2333 Views
  • 1 replies
  • 0 kudos
Latest Reply
Nhan_Nguyen
Valued Contributor
  • 0 kudos

Could you try to map s3 bucket location with Databricks File System then write output to this new location instead of directly write to S3 location.

  • 0 kudos
sedat
by New Contributor II
  • 2189 Views
  • 2 replies
  • 2 kudos

Hi, is there any document for databricks about performance tuning and reporting?

Hi, I need to analyse performance issues for databricks. Is there any document or monitoring tool to run to see what is happening in databricks? I am very new in databricks. Best

  • 2189 Views
  • 2 replies
  • 2 kudos
Latest Reply
Nhan_Nguyen
Valued Contributor
  • 2 kudos

You could try some courses in "https://customer-academy.databricks.com/"What's New In Apache Spark 3.0Optimizing Apache Spark on Databricks

  • 2 kudos
1 More Replies
Callum
by New Contributor II
  • 13036 Views
  • 3 replies
  • 2 kudos

Pyspark Pandas column or index name appears to persist after being dropped or removed.

So, I have this code for merging dataframes with pyspark pandas. And I want the index of the left dataframe to persist throughout the joins. So following suggestions from others wanting to keep the index after merging, I set the index to a column bef...

  • 13036 Views
  • 3 replies
  • 2 kudos
Latest Reply
Serlal
New Contributor III
  • 2 kudos

Hi!I tried debugging your code and I think that the error you get is simply because the column exists in two instances of your dataframe within your loop.I tried adding some extra debug lines in your merge_dataframes function:and after executing that...

  • 2 kudos
2 More Replies
sonalitotade
by New Contributor II
  • 2048 Views
  • 2 replies
  • 0 kudos

Capture events such as Start, Stop and Terminate of cluster.

Hi,I am using databricks with AWS.I need to capture events such as Start, Stop and Terminate of cluster and perform some other action based on the events that happened on the cluster.Is there a way I can achieve this in databricks?

  • 2048 Views
  • 2 replies
  • 0 kudos
Latest Reply
sonalitotade
New Contributor II
  • 0 kudos

Hi Daniel, thanks for the responseI would like to know if we can capture the event logs as shown in the image below when an event occurs on the cluster.

  • 0 kudos
1 More Replies
KVNARK
by Honored Contributor II
  • 15588 Views
  • 2 replies
  • 5 kudos

Resolved! pyspark optimizations and best practices

What and all we can implement maximum to attain the best optimization and which are all the best practices using PySpark end to end.

  • 15588 Views
  • 2 replies
  • 5 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 5 kudos

@KVNARK .​  This video is cool.https://www.youtube.com/watch?v=daXEp4HmS-E

  • 5 kudos
1 More Replies
Gandham
by New Contributor II
  • 4208 Views
  • 3 replies
  • 2 kudos

Maven Libraries are failing on restarting the cluster.

I have installed "com.databricks:spark-xml_2.12:0.16.0" maven libraries to a cluster. The installation was successful. But when I restart the cluster, even this successful installation becomes failed. This happens with all Maven Libraries. Here is th...

  • 4208 Views
  • 3 replies
  • 2 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 2 kudos

it is intermittent issue, we also faced this issue earlier ,try to upgrade DBR version

  • 2 kudos
2 More Replies
Therdpong
by New Contributor III
  • 2068 Views
  • 2 replies
  • 0 kudos

how to check what jobs cluster to have expanddisk.

We would like to know how to check what jobs cluster to have to expand disk.

  • 2068 Views
  • 2 replies
  • 0 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 0 kudos

You can check in the cluster's event logs. You can type in the search box, "disk" and you will see all the events there.

  • 0 kudos
1 More Replies
SS2
by Valued Contributor
  • 2066 Views
  • 2 replies
  • 1 kudos

Spark out of memory error.

Sometimes in Databricks you can see the out of memory error then in that case you can change the cluster size. As per requirement to resolve the issue.

  • 2066 Views
  • 2 replies
  • 1 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 1 kudos

Hi @S S​,Could you provide more details on your issue? for example, error stack traces, code snippet, etc. We will be able to help you if you share more details

  • 1 kudos
1 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels