cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

soumiknow
by Contributor
  • 510 Views
  • 14 replies
  • 1 kudos

BQ partition data deleted fully even though 'spark.sql.sources.partitionOverwriteMode' is DYNAMIC

We have a date (DD/MM/YYYY) partitioned BQ table. We want to update a specific partition data in 'overwrite' mode using PySpark. So to do this, I applied 'spark.sql.sources.partitionOverwriteMode' to 'DYNAMIC' as per the spark bq connector documentat...

  • 510 Views
  • 14 replies
  • 1 kudos
Latest Reply
VZLA
Databricks Employee
  • 1 kudos

@soumiknow  This is not the output from -verbose:class, what you see is likely coming from importing the library from an external repository and its showing the add dependencies process. Telling it has pulled and downloaded the "com.google.cloud.spar...

  • 1 kudos
13 More Replies
Einsatz
by New Contributor
  • 117 Views
  • 4 replies
  • 2 kudos

Resolved! Photon enabled UC cluster has less executor memory(1/4th) compared to normal cluster.

I have a Unity Catalog Enabled cluster with Node type Standard_DS4_v2 (28 GB Memory, 8 Cores). When "Use Photon Acceleration" option is disabled spark.executor.memory is 18409m. But if I enable Photon Acceleration it shows spark.executor.memory as 46...

  • 117 Views
  • 4 replies
  • 2 kudos
Latest Reply
Walter_C
Databricks Employee
  • 2 kudos

The memory allocated to the Photon engine is not fixed; it is based on a percentage of the node’s total memory. To calculate the value of spark.executor.memory based on a specific node type, you can use the following formula: container_size = (vm_si...

  • 2 kudos
3 More Replies
guiferviz
by New Contributor III
  • 265 Views
  • 8 replies
  • 4 kudos

Resolved! How to Determine if Materialized View is Performing Full or Incremental Refresh?

I'm currently testing materialized views and I need some help understanding the refresh behavior. Specifically, I want to know if my materialized view is querying the full table (performing a full refresh) or just doing an incremental refresh.From so...

  • 265 Views
  • 8 replies
  • 4 kudos
Latest Reply
TejeshS
New Contributor
  • 4 kudos

To validate the status of your materialized view (MV) refresh, run a DESCRIBE EXTENDED command and check the row corresponding to the "last refresh status type."RECOMPUTE indicates a full load execution was completed.NO_OPERATION means no operation w...

  • 4 kudos
7 More Replies
soumiknow
by Contributor
  • 296 Views
  • 10 replies
  • 2 kudos

Resolved! How to resolved 'connection refused' error while using a google-cloud lib in Databricks Notebook?

I want to use google-cloud-bigquery library in my PySpark code though I know that spark-bigquery-connector is available. The reason I want to use is that the Databricks Cluster 15.4LTS comes with 0.22.2-SNAPSHOT version of spark-bigquery-connector wh...

  • 296 Views
  • 10 replies
  • 2 kudos
Latest Reply
VZLA
Databricks Employee
  • 2 kudos

@soumiknow sounds good ! Please let me know if you need some internal assistance with the communication process.

  • 2 kudos
9 More Replies
Edthehead
by Contributor II
  • 209 Views
  • 1 replies
  • 0 kudos

Restoring a table from a Delta live pipeline

I have a DLT pipeline running to ingest files from storage using autoloader. We have a bronze table and a Silver table.A question came up from the team on how to restore DLT tables to a previous version in case of some incorrect transformation. When ...

  • 209 Views
  • 1 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

  The RESTORE command is not supported on streaming tables, which is why you encountered the error. Instead, you can use the TIME TRAVEL feature of Delta Lake to query previous versions of the table. You can use the VERSION AS OF or TIMESTAMP AS OF c...

  • 0 kudos
dynia
by New Contributor
  • 79 Views
  • 1 replies
  • 0 kudos

Rest API version 1

How long rest api in version 1 will be support ?

  • 79 Views
  • 1 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

There is no mentioned of support duration for Databricks REST API version 1. I can check internally. Do you have any API in specific?

  • 0 kudos
Omri
by New Contributor
  • 182 Views
  • 3 replies
  • 0 kudos

Optimizing a complex pyspark join

I have a complex join that I'm trying to optimize df1 has cols id,main_key,col1,col1_isnull,col2,col2_isnull...col30 df2 has cols id,main_key,col1,col2..col_30I'm trying to run this sql query on Pysparkselect df1.id, df2.id from df1 join df2 on df1.m...

  • 182 Views
  • 3 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@Omri thanks for your question! To help optimize your complex join further, we need clarification on a few details:   Data Characteristics: Approximate size of df1 and df2 (in rows and/or size).Distribution of main_key in both dataframes—are the top...

  • 0 kudos
2 More Replies
ls
by New Contributor II
  • 269 Views
  • 5 replies
  • 0 kudos

Py4JJavaError: An error occurred while calling o552.count()

Hey! I'm new to the forums but not Databricks, trying to get some help with this question:The error also is also fickle since it only appears what seems to be random. Like when running the same code it works then on the next run with a new set of dat...

  • 269 Views
  • 5 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

@ls thanks for your question! Since this is a PySpark application, the "Connection reset by peer" error seems to mask the actual exception. This type of issue is often linked to memory problems where Python workers are terminated, so the JVM <-> Pyth...

  • 0 kudos
4 More Replies
svm_varma
by New Contributor II
  • 189 Views
  • 1 replies
  • 2 kudos

Resolved! Azure Databricks quota restrictions on compute in Azure for students subscription

Hi All,Regrading creating clusters in Databricks I'm getting quota error have tried to increase quotas in the region where the resource is hosted still unable to increase the limit, is there any workaround  or could you help select the right cluster ...

svm_varma_1-1735552504129.png svm_varma_0-1735552319815.png svm_varma_2-1735552549290.png
  • 189 Views
  • 1 replies
  • 2 kudos
Latest Reply
szymon_dybczak
Esteemed Contributor III
  • 2 kudos

Hi @svm_varma ,You can try to create Standard_DS3_v2 cluster. It has 4 cores and your current subscription limit for given region is 6 cores. The one you're trying to create needs 8 cores and hence you're getting quota exceeded exception.You can also...

  • 2 kudos
singhanuj2803
by New Contributor III
  • 98 Views
  • 1 replies
  • 1 kudos

Apache Spark SQL query to get organization hierarchy

I'm currently diving deep into Spark SQL and its capabilities, and I'm facing an interesting challenge. I'm eager to learn how to write CTE recursive queries in Spark SQL, but after thorough research, it seems that Spark doesn't natively support recu...

rr.png RR1.png
  • 98 Views
  • 1 replies
  • 1 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 1 kudos

Hi @singhanuj2803, It is correct that Spark SQL does not natively support recursive Common Table Expressions (CTEs). However, there are some workarounds and alternative methods you can use to achieve similar results.   Using DataFrame API with Loops:...

  • 1 kudos
ossoul
by New Contributor
  • 487 Views
  • 1 replies
  • 0 kudos

Not able to get spark application in Spark History server using cluster eventlogs

I'm encountering an issue with incomplete Spark event logs. When I am running the local Spark History Server using the cluster logs, my application appears as "incomplete". Sometime I also see few queries listed as still running, even though the appl...

  • 487 Views
  • 1 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

Thanks for your question! I believe Databricks has its own SHS implementation, so it's not expected to work with the vanilla SHS. Regarding the queries marked as still running, we can also find this when there are event logs which were not properly c...

  • 0 kudos
martindlarsson
by New Contributor III
  • 312 Views
  • 1 replies
  • 0 kudos

Jobs indefinitely pending with libraries install

I think I found a bug where you get Pending indefinitely on jobs that has a library requirement and the user of the job does not have Manage permission on the cluster.In my case I was trying to start a dbt job with dbt-databricks=1.8.5 as library. Th...

  • 312 Views
  • 1 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

Thanks for your feedback! Just checking is this still an issue for you? would you share more details? if I wanted to reproduce this for example.

  • 0 kudos
ashraf1395
by Valued Contributor
  • 346 Views
  • 1 replies
  • 0 kudos

Schema issue while fetching data from oracle

I dont have the complete context of the issue.But Here it is what I know, a friend of mine facing this""I am fetching data from Oracle data in databricks using python.But every time i do it the schema gets changesso if the column is of type decimal f...

  • 346 Views
  • 1 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

Thanks for your question!To address schema issues when fetching Oracle data in Databricks, use JDBC schema inference to define data types programmatically or batch-cast columns dynamically after loading. For performance, enable predicate pushdown and...

  • 0 kudos
chris_b
by New Contributor
  • 328 Views
  • 1 replies
  • 0 kudos

Increase Stack Size for Python Subprocess

I need to increase the stack size (from the default of 16384) to run a subprocess that requires a larger stack size.I tried following this: https://community.databricks.com/t5/data-engineering/increase-stack-size-databricks/td-p/71492And this: https:...

  • 328 Views
  • 1 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

Thanks for your question!Are you referring to a Java stack size (-Xss) or a Python subprocess (ulimit -s)?

  • 0 kudos
dener
by New Contributor
  • 355 Views
  • 1 replies
  • 0 kudos

Infinity load execution

I am experiencing performance issues when loading a table with 50 million rows into Delta Lake on AWS using Databricks. Despite successfully handling other larger tables, this especific table/process takes hours and doesn't finish. Here's the command...

  • 355 Views
  • 1 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

Thank you for your question! To optimize your Delta Lake write process: Disable Overhead Options: Avoid overwriteSchema and mergeSchema unless necessary. Use: df.write.format("delta").mode("overwrite").save(sink) Increase Parallelism: Use repartition...

  • 0 kudos
Labels