cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

mehalrathod
by New Contributor II
  • 1525 Views
  • 2 replies
  • 0 kudos

Overwrite to a table taking 12+ hours

One of our Databricks notebook (using python, py-spark) has been running long for 12+ hours specifically on the overwrite command into a table. This notebook along with overwrite step has been completed within 10 mins in the past. But suddenly the ov...

  • 1525 Views
  • 2 replies
  • 0 kudos
Latest Reply
lingareddy_Alva
Honored Contributor III
  • 0 kudos

Hi @mehalrathod This sort of performance regression in Databricks (especially for overwrite) is usually caused by one or more of the following:Common Causes of Overwrite Slowness1. Delta Table History or File Explosion- If the target table is a Delta...

  • 0 kudos
1 More Replies
Linda22
by New Contributor II
  • 6655 Views
  • 7 replies
  • 5 kudos

Can we execute a single task in isolation from a multi task Databricks job

A task may be used to process some data. If we have 10 such tasks in a job and we want to process only a couple of datasets only through a couple of tasks, is that possible? 

  • 6655 Views
  • 7 replies
  • 5 kudos
Latest Reply
slimbnsalah
New Contributor II
  • 5 kudos

Generally available!

  • 5 kudos
6 More Replies
Livingstone
by New Contributor II
  • 4252 Views
  • 5 replies
  • 3 kudos

Install maven package to serverless cluster

My task is to export data from CSV/SQL into Excel format with minimal latency. To achieve this, I used a Serverless cluster.Since PySpark does not support saving in XLSX format, it is necessary to install the Maven package spark-excel_2.12. However, ...

  • 4252 Views
  • 5 replies
  • 3 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 3 kudos

As you stated, ou cannot install Maven packages on Databricks serverless clusters due to restricted library management capabilities. However, there are alternative approaches to export data to Excel with minimal latency. Solutions to Export Excel Fi...

  • 3 kudos
4 More Replies
krishnakmr512
by New Contributor
  • 1104 Views
  • 1 replies
  • 1 kudos

Resolved! Missed my certification Exam Reschedule is required

Hi Team, @data_help @helpdesk @Cert-Team @Cert-TeamOPS I have missed the certification exam schedule due to an emergency situation yesterday, Is there a possibility this can be reschedule to today anytime or tomorrow? I am not able to reschedule opti...

Data Engineering
@Cert-Team
  • 1104 Views
  • 1 replies
  • 1 kudos
Latest Reply
Cert-Team
Databricks Employee
  • 1 kudos

@krishnakmr512 usually the fastest way to get assistance is filing a ticket with our support team. I was able to reschedule your exam to future date. Please log into your account and reschedule to a date and time that suits you.

  • 1 kudos
db_eswar
by New Contributor
  • 2096 Views
  • 2 replies
  • 1 kudos

what is iowait, will it impact performance of my job

One job taking more than 7hrs, when i added below configuration its taking <2:30 mins but after deployment with same parameters again its taking 7+hrs. 1) spark.conf.set("spark.sql.shuffle.partitions", 500) --> spark.conf.set("spark.sql.shuffle.parti...

  • 2096 Views
  • 2 replies
  • 1 kudos
Latest Reply
SP_6721
Honored Contributor II
  • 1 kudos

Hi @db_eswar High iowait in your Spark jobs is probably caused by storage or disk bottlenecks, not CPU or memory issues. The slowdown you're seeing could be due to a cold cache, slower disks, or increased resource usage.To troubleshoot, you can use t...

  • 1 kudos
1 More Replies
AlessandroM
by New Contributor II
  • 1929 Views
  • 1 replies
  • 1 kudos

PySpark Structured Streaming job doesn't unpersist DataFrames

Hi community,I am currently developing a pyspark job (running on runtime 14.3 LTS) using structured streaming.Our streaming job uses forEachBatch , and inside it we are calling persist (and subsequent unpersist) on two DataFrames. We are noticing fro...

  • 1929 Views
  • 1 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

The issue you’re encountering—where unpersist() does not seem to release memory for persisted DataFrames in your Structured Streaming job—likely relates to nuances of the Spark caching mechanism and how it interacts with the lifecycle of micro-batch ...

  • 1 kudos
RameshChejarla
by New Contributor III
  • 1024 Views
  • 2 replies
  • 1 kudos

Databricks

Hi Everyone,I have implemented Auto loader working as expected. i need to track the files which are loaded into stage table.Here is the issue, the file tracking table need to create in snowflake from here i need to track the files.How to connect data...

  • 1024 Views
  • 2 replies
  • 1 kudos
Latest Reply
RameshChejarla
New Contributor III
  • 1 kudos

Thanks for your reply, will try and let you know

  • 1 kudos
1 More Replies
ankit001mittal
by New Contributor III
  • 2832 Views
  • 8 replies
  • 3 kudos

DLT Query History

Hi guys,I can see that in DLT pipelines we have query history section where we can see the duration of each tables and number of rows read.Is this information stored somewhere in the systems catalog? can I query this information?

Screenshot 2025-04-22 145638.png
  • 2832 Views
  • 8 replies
  • 3 kudos
Latest Reply
RiyazAliM
Honored Contributor
  • 3 kudos

Hey @ankit001mittal - if any of the above responses provided answer to your questions, kindly, mark it as a solution.Thanks,

  • 3 kudos
7 More Replies
Prashanth24
by New Contributor III
  • 1833 Views
  • 2 replies
  • 0 kudos

Databricks Autoloader processing old files

I have implemented Databricks Autoloader and found that every time i executes the code, it is still reading all old existing files + new files. As per the concept of Autoloader, it should read and process only new files. Below is the code. Please hel...

  • 1833 Views
  • 2 replies
  • 0 kudos
Latest Reply
RameshChejarla
New Contributor III
  • 0 kudos

Hi Prashanth, Auto loader for me its reading only new files, can you pls go through the below script.df = (spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv").option("cloudFiles.schemaLocation", "path").option("recursiveFileLooku...

  • 0 kudos
1 More Replies
kaushalshelat
by New Contributor II
  • 1145 Views
  • 2 replies
  • 4 kudos

Resolved! I cannot see the output when using pandas_api() on spark dataframe

Hi all, I started learning spark and databricks recently along with python. while running below line of code it did not throw any error and seem to run ok but didn't show me output either.test=cust_an_inc1.pandas_api()test.show()where cust_an_inc1 is...

  • 1145 Views
  • 2 replies
  • 4 kudos
Latest Reply
RiyazAliM
Honored Contributor
  • 4 kudos

Hi @kaushalshelat Ideally, `test.show()` should've thrown an error as test is the pandas dataframe now.`.show()` is a spark df method and wouldn't work with pandas.If you want to see a subset of the data try `.head()` or `.tail(n)` rather then `.show...

  • 4 kudos
1 More Replies
jeremy98
by Honored Contributor
  • 1644 Views
  • 4 replies
  • 0 kudos

how to fallback the entire job in case of failure of the cluster?

Hi community,My team and I are using a job that is triggered based on dynamic scheduling, with the schedule defined within some of the job's tasks. However, this job is attached to a cluster that is always running and never terminated.I understand th...

  • 1644 Views
  • 4 replies
  • 0 kudos
Latest Reply
RiyazAliM
Honored Contributor
  • 0 kudos

Hey @jeremy98 Have you had a chance to experiment with Databricks Serverless offering? Ideally, serverless would spin up times are around ~1 min. It has inbuilt autoscaling based on the workload, seems good fit for your usecase. Check out more info f...

  • 0 kudos
3 More Replies
suja
by New Contributor
  • 1199 Views
  • 1 replies
  • 0 kudos

Exploring parallelism for multiple tables

I am new to databricks. The app we need to build reads from hive tables, go thru bronze, silver and gold layers and store in relational db tables.  There are multiple hive tables with no dependencies. What is the best way to achieve parallelism. Do w...

  • 1199 Views
  • 1 replies
  • 0 kudos
Latest Reply
lingareddy_Alva
Honored Contributor III
  • 0 kudos

Hi @suja Use Databricks Workflows (Jobs) with Task ParallelismInstead of using threads within a single notebook, leverage Databricks Jobs to define multiple tasks, each responsible for a table. Tasks can:                     1. Run in parallel       ...

  • 0 kudos
ABINASH
by New Contributor
  • 975 Views
  • 1 replies
  • 0 kudos

Flattening VARIANT column.

Hi Team, I am facing an issue, i have a json file which is around 700kb and it contains only 1 record, so after reading the data and flattening the file the record is now 620 million. Now while i am writing the dataframe into delta lake it is taking ...

  • 975 Views
  • 1 replies
  • 0 kudos
Latest Reply
samshifflett46
New Contributor III
  • 0 kudos

Hey @ABINASH, The JSON file being flattened to 620 million records seems like the area of optimization would be to restructure the JSON file. My initial thought being that the JSON file is extremely nested which is causing a large amount of redundant...

  • 0 kudos
sondergaard
by New Contributor II
  • 1754 Views
  • 2 replies
  • 0 kudos

Simba ODBC driver // .Net Core

Hi,I have been looking into the Simba Spark ODBC driver to see if it can simplify our integration with .Net Core. The first results were promising, but when I started to process larger queries I started to notice out-of-memory exceptions in the conta...

  • 1754 Views
  • 2 replies
  • 0 kudos
Latest Reply
Rjdudley
Honored Contributor
  • 0 kudos

Something we're considering for a similar purpose (.NET Core service pulling data from Databricks) is the ADO.NET connector from CData: Databricks Driver: ADO.NET Provider | Create & integrate .NET apps

  • 0 kudos
1 More Replies
ashraf1395
by Honored Contributor
  • 1355 Views
  • 1 replies
  • 0 kudos

Fething the catalog and schema which is set in dlt pipeline configuration

I have a dlt pipeline and the notebook which is running on the dlt pipeline has some requirements.I want to get the catalog and schema which is set my dlt pipeline. Reason for it: I have to specify my volume files paths etc and my volume is on the sa...

  • 1355 Views
  • 1 replies
  • 0 kudos
Latest Reply
SP_6721
Honored Contributor II
  • 0 kudos

Hi @ashraf1395 Can you try this to get the catalog and schema set by your DLT pipeline in the notebookcatalog = spark.conf.get("pipelines.catalog")schema = spark.conf.get("pipelines.schema")

  • 0 kudos
Labels