cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

svrdragon
by New Contributor
  • 3071 Views
  • 0 replies
  • 0 kudos

optimizeWrite takes too long

Hi , We have a spark job write data in delta table for last 90 date partition. We have enabled spark.databricks.delta.autoCompact.enabled and delta.autoOptimize.optimizeWrite. Job takes 50 mins to complete. In that logic takes 12 mins and optimizewri...

  • 3071 Views
  • 0 replies
  • 0 kudos
erigaud
by Honored Contributor
  • 11513 Views
  • 3 replies
  • 0 kudos

Merge DLT with Delta Table

Is there anyway to accomplish this ? I have an existing Delta Table and a separate Delta Live Table pipelines and I would like to merge data from a DLT to my existing Delta Table. Is this doable or completely impossible ?

  • 11513 Views
  • 3 replies
  • 0 kudos
Latest Reply
LeifBruen
New Contributor II
  • 0 kudos

Merging data from a Delta Live Table (DLT) into an existing Delta Table is possible with careful planning. Transition data from DLT to Delta Table through batch processing, data transformation, and ETL processes, ensuring schema compatibility. 

  • 0 kudos
2 More Replies
NotARobot
by New Contributor III
  • 2020 Views
  • 0 replies
  • 2 kudos

Force DBR/Spark Version in Delta Live Tables Cluster Policy

Is there a way to use Compute Policies to force Delta Live Tables to use specific Databricks Runtime and PySpark versions? While trying to leverage some of the functions in PySpark 3.5.0, I don't seem to be able to get Delta Live Tables to use Databr...

test_cluster_policy.png dlt_version.png
Data Engineering
Compute Policies
Delta Live Tables
Graphframes
pyspark
  • 2020 Views
  • 0 replies
  • 2 kudos
JohnJustus
by New Contributor III
  • 14674 Views
  • 1 replies
  • 0 kudos

Accessing Excel file from Databricks

Hi,I am trying to access excel file that is stored in Azure Blob storage via Databricks.In my understanding, it is not possible to access using Pyspark. So accessing through Pandas is the option,Here is my code.%pip install openpyxlimport pandas as p...

  • 14674 Views
  • 1 replies
  • 0 kudos
databicky
by Contributor II
  • 6722 Views
  • 3 replies
  • 1 kudos

No handler for udf/udaf/udtf for function

i created one function using jar file which is present in the cluster location, but when executing the hive query it is showing error as no handler for udf/udaf/udtf . this queries is running fine in hd insight clusters but when running in databricks...

IMG20231015164650.jpg
  • 6722 Views
  • 3 replies
  • 1 kudos
dbuser1234
by New Contributor
  • 3334 Views
  • 0 replies
  • 0 kudos

How to readstream from multiple sources?

Hi I am trying to readstream from 2 sources and join them into a target table. How can I do this in pyspark? Egt1 + t2 as my bronze table. I want to readstream from t1 and t2, and merge the changes into t3 (silver table)

  • 3334 Views
  • 0 replies
  • 0 kudos
anmol_hans_de
by Databricks Partner
  • 9088 Views
  • 0 replies
  • 0 kudos

Exam suspended by proctor

Hi Team,I need urgent support since I was about to submit my exam and was just reviewing the responses but proctor suspended it because i did not satisfy the proctoring conditions. Even though i was sitting in a room with clear background and well li...

  • 9088 Views
  • 0 replies
  • 0 kudos
BST
by New Contributor
  • 1707 Views
  • 0 replies
  • 0 kudos

Spark - Cluster Mode - Driver

When running a Spark Job in Cluster Mode, how does Spark decide which worker node to place the driver resources ? 

  • 1707 Views
  • 0 replies
  • 0 kudos
anirudh_a
by New Contributor II
  • 22871 Views
  • 8 replies
  • 5 kudos

Resolved! 'No file or Directory' error when using pandas.read_excel in Databricks

I am baffled by the behaviour of Databricks:Below you can see the contents of the directory using dbutils in Databricks. It shows the `test.xlsx` file clearly in directory (and I can even open it using `dbutils.fs.head`) But when I go to use panda.re...

wCLqf
Data Engineering
dbfs
panda
spark
spark config
  • 22871 Views
  • 8 replies
  • 5 kudos
Latest Reply
DamnKush
New Contributor II
  • 5 kudos

Hey, I encountered it recently. I can see you are using the shared cluster, try switching to a single user cluster and it will fix it.Can someone let me know why it wasn't working w a shared cluster?Thanks.

  • 5 kudos
7 More Replies
Joe1912
by New Contributor III
  • 1708 Views
  • 0 replies
  • 0 kudos

Strategy to add new table base on silver data

I have a merge function for streaming foreachBatch kind ofmergedf(df,i):    merge_func_1(df,i)     merge_func_2(df,i)Then I want to add new merge_func_3 into it. Is there any best practices for this case? when streaming always runs, how can I process...

  • 1708 Views
  • 0 replies
  • 0 kudos
UtkarshTrehan
by New Contributor
  • 16531 Views
  • 1 replies
  • 1 kudos

Inconsistent Results When Writing to Oracle DB with Spark's dropDuplicates and foreachPartition

It's more a spark question then a databricks question, I'm encountering an issue when writing data to an Oracle database using Apache Spark. My workflow involves removing duplicate rows from a DataFrame and then writing the deduplicated DataFrame to ...

  • 16531 Views
  • 1 replies
  • 1 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 1 kudos

The difference in behaviour between using foreachPartition and data.write.jdbc(...) after dropDuplicates() could be due to how Spark handles data partitioning and operations on partitions. When you use foreachPartition, you are manually handling the ...

  • 1 kudos
Labels