Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Hi All,I have some data in Delta table with multiple columns and each record has a unique identifier.I want to update some columns as per the new values coming in for each of these unique records. However updating one record at a time is taking a lot...
* Reading Avro files from s3 and then writing to the delta table * Ingested sample data of 10 files, which contain four columns, and it infers the schema automatically as expected * Introducing a new file which contains a new column [foo] along wi...
We are trying to create a DELTA table (CTAS statement) from 2 TB PARQUET file and its taking huge amount of time around 12~ hrs.is it normal.? What are option to tune/optimize this ? are we doing anything wrongCluster : Interactive/30 Cores / 320 GB ...
Please use COPY INTO (first create an empty delta table) or CONVERT TO DELTA instead of CTAS it will be much more faster, and it process will be auto-optimized.
Hello,We are experiencing an error with one Structured Streaming Job that we have, that basically the checkpoint gets corrupted and we are unable to continue with the execution.I've checked the errors and this happens when it triggers an autocompact,...
Hi @Martin Riccardi,Could you share the following please:1) whats your Source?2) whats your Sink?3) could you share your readStream() and writeStream() code?4) full error stack trace5) did you stop and re-run your query after weeks of not being acti...
Hi, I'm a fairly new user and I am using Azure Databricks to process a ~1000GiB JSON nested file containing insurance policy data. I uploaded the JSON file to Azure Data Lake Gen2 storage and read the JSON file into a dataframe.df=spark.read.option("...
Hi Sameer, please refer to following documents on how to work with nested json:https://docs.databricks.com/optimizations/semi-structured.htmlhttps://learn.microsoft.com/en-us/azure/databricks/kb/_static/notebooks/scala/nested-json-to-dataframe.html
Hey,I have problem with access to s3 bucket using cross account bucket permission, i got the following error:Steps to repreduce:Checking the role that assoicated to ec2 instance:{
"Version": "2012-10-17",
"Statement": [
{
...
Hi, I need to analyse performance issues for databricks. Is there any document or monitoring tool to run to see what is happening in databricks? I am very new in databricks. Best
So, I have this code for merging dataframes with pyspark pandas. And I want the index of the left dataframe to persist throughout the joins. So following suggestions from others wanting to keep the index after merging, I set the index to a column bef...
Hi!I tried debugging your code and I think that the error you get is simply because the column exists in two instances of your dataframe within your loop.I tried adding some extra debug lines in your merge_dataframes function:and after executing that...
Hi,I am using databricks with AWS.I need to capture events such as Start, Stop and Terminate of cluster and perform some other action based on the events that happened on the cluster.Is there a way I can achieve this in databricks?
Hi Daniel, thanks for the responseI would like to know if we can capture the event logs as shown in the image below when an event occurs on the cluster.
I have installed "com.databricks:spark-xml_2.12:0.16.0" maven libraries to a cluster. The installation was successful. But when I restart the cluster, even this successful installation becomes failed. This happens with all Maven Libraries. Here is th...
Sometimes in Databricks you can see the out of memory error then in that case you can change the cluster size. As per requirement to resolve the issue.
Hi @S S,Could you provide more details on your issue? for example, error stack traces, code snippet, etc. We will be able to help you if you share more details