cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Livingstone
by New Contributor II
  • 3844 Views
  • 5 replies
  • 3 kudos

Install maven package to serverless cluster

My task is to export data from CSV/SQL into Excel format with minimal latency. To achieve this, I used a Serverless cluster.Since PySpark does not support saving in XLSX format, it is necessary to install the Maven package spark-excel_2.12. However, ...

  • 3844 Views
  • 5 replies
  • 3 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 3 kudos

As you stated, ou cannot install Maven packages on Databricks serverless clusters due to restricted library management capabilities. However, there are alternative approaches to export data to Excel with minimal latency. Solutions to Export Excel Fi...

  • 3 kudos
4 More Replies
krishnakmr512
by New Contributor
  • 988 Views
  • 1 replies
  • 1 kudos

Resolved! Missed my certification Exam Reschedule is required

Hi Team, @data_help @helpdesk @Cert-Team @Cert-TeamOPS I have missed the certification exam schedule due to an emergency situation yesterday, Is there a possibility this can be reschedule to today anytime or tomorrow? I am not able to reschedule opti...

Data Engineering
@Cert-Team
  • 988 Views
  • 1 replies
  • 1 kudos
Latest Reply
Cert-Team
Databricks Employee
  • 1 kudos

@krishnakmr512 usually the fastest way to get assistance is filing a ticket with our support team. I was able to reschedule your exam to future date. Please log into your account and reschedule to a date and time that suits you.

  • 1 kudos
db_eswar
by New Contributor
  • 1721 Views
  • 2 replies
  • 1 kudos

what is iowait, will it impact performance of my job

One job taking more than 7hrs, when i added below configuration its taking <2:30 mins but after deployment with same parameters again its taking 7+hrs. 1) spark.conf.set("spark.sql.shuffle.partitions", 500) --> spark.conf.set("spark.sql.shuffle.parti...

  • 1721 Views
  • 2 replies
  • 1 kudos
Latest Reply
SP_6721
Honored Contributor II
  • 1 kudos

Hi @db_eswar High iowait in your Spark jobs is probably caused by storage or disk bottlenecks, not CPU or memory issues. The slowdown you're seeing could be due to a cold cache, slower disks, or increased resource usage.To troubleshoot, you can use t...

  • 1 kudos
1 More Replies
AlessandroM
by New Contributor II
  • 1583 Views
  • 1 replies
  • 1 kudos

PySpark Structured Streaming job doesn't unpersist DataFrames

Hi community,I am currently developing a pyspark job (running on runtime 14.3 LTS) using structured streaming.Our streaming job uses forEachBatch , and inside it we are calling persist (and subsequent unpersist) on two DataFrames. We are noticing fro...

  • 1583 Views
  • 1 replies
  • 1 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 1 kudos

The issue you’re encountering—where unpersist() does not seem to release memory for persisted DataFrames in your Structured Streaming job—likely relates to nuances of the Spark caching mechanism and how it interacts with the lifecycle of micro-batch ...

  • 1 kudos
RameshChejarla
by New Contributor III
  • 947 Views
  • 2 replies
  • 1 kudos

Databricks

Hi Everyone,I have implemented Auto loader working as expected. i need to track the files which are loaded into stage table.Here is the issue, the file tracking table need to create in snowflake from here i need to track the files.How to connect data...

  • 947 Views
  • 2 replies
  • 1 kudos
Latest Reply
RameshChejarla
New Contributor III
  • 1 kudos

Thanks for your reply, will try and let you know

  • 1 kudos
1 More Replies
ankit001mittal
by New Contributor III
  • 2577 Views
  • 8 replies
  • 3 kudos

DLT Query History

Hi guys,I can see that in DLT pipelines we have query history section where we can see the duration of each tables and number of rows read.Is this information stored somewhere in the systems catalog? can I query this information?

Screenshot 2025-04-22 145638.png
  • 2577 Views
  • 8 replies
  • 3 kudos
Latest Reply
RiyazAliM
Honored Contributor
  • 3 kudos

Hey @ankit001mittal - if any of the above responses provided answer to your questions, kindly, mark it as a solution.Thanks,

  • 3 kudos
7 More Replies
Prashanth24
by New Contributor III
  • 1765 Views
  • 2 replies
  • 0 kudos

Databricks Autoloader processing old files

I have implemented Databricks Autoloader and found that every time i executes the code, it is still reading all old existing files + new files. As per the concept of Autoloader, it should read and process only new files. Below is the code. Please hel...

  • 1765 Views
  • 2 replies
  • 0 kudos
Latest Reply
RameshChejarla
New Contributor III
  • 0 kudos

Hi Prashanth, Auto loader for me its reading only new files, can you pls go through the below script.df = (spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv").option("cloudFiles.schemaLocation", "path").option("recursiveFileLooku...

  • 0 kudos
1 More Replies
kaushalshelat
by New Contributor II
  • 1056 Views
  • 2 replies
  • 4 kudos

Resolved! I cannot see the output when using pandas_api() on spark dataframe

Hi all, I started learning spark and databricks recently along with python. while running below line of code it did not throw any error and seem to run ok but didn't show me output either.test=cust_an_inc1.pandas_api()test.show()where cust_an_inc1 is...

  • 1056 Views
  • 2 replies
  • 4 kudos
Latest Reply
RiyazAliM
Honored Contributor
  • 4 kudos

Hi @kaushalshelat Ideally, `test.show()` should've thrown an error as test is the pandas dataframe now.`.show()` is a spark df method and wouldn't work with pandas.If you want to see a subset of the data try `.head()` or `.tail(n)` rather then `.show...

  • 4 kudos
1 More Replies
jeremy98
by Honored Contributor
  • 1448 Views
  • 4 replies
  • 0 kudos

how to fallback the entire job in case of failure of the cluster?

Hi community,My team and I are using a job that is triggered based on dynamic scheduling, with the schedule defined within some of the job's tasks. However, this job is attached to a cluster that is always running and never terminated.I understand th...

  • 1448 Views
  • 4 replies
  • 0 kudos
Latest Reply
RiyazAliM
Honored Contributor
  • 0 kudos

Hey @jeremy98 Have you had a chance to experiment with Databricks Serverless offering? Ideally, serverless would spin up times are around ~1 min. It has inbuilt autoscaling based on the workload, seems good fit for your usecase. Check out more info f...

  • 0 kudos
3 More Replies
suja
by New Contributor
  • 1084 Views
  • 1 replies
  • 0 kudos

Exploring parallelism for multiple tables

I am new to databricks. The app we need to build reads from hive tables, go thru bronze, silver and gold layers and store in relational db tables.  There are multiple hive tables with no dependencies. What is the best way to achieve parallelism. Do w...

  • 1084 Views
  • 1 replies
  • 0 kudos
Latest Reply
lingareddy_Alva
Honored Contributor III
  • 0 kudos

Hi @suja Use Databricks Workflows (Jobs) with Task ParallelismInstead of using threads within a single notebook, leverage Databricks Jobs to define multiple tasks, each responsible for a table. Tasks can:                     1. Run in parallel       ...

  • 0 kudos
ABINASH
by New Contributor
  • 885 Views
  • 1 replies
  • 0 kudos

Flattening VARIANT column.

Hi Team, I am facing an issue, i have a json file which is around 700kb and it contains only 1 record, so after reading the data and flattening the file the record is now 620 million. Now while i am writing the dataframe into delta lake it is taking ...

  • 885 Views
  • 1 replies
  • 0 kudos
Latest Reply
samshifflett46
New Contributor III
  • 0 kudos

Hey @ABINASH, The JSON file being flattened to 620 million records seems like the area of optimization would be to restructure the JSON file. My initial thought being that the JSON file is extremely nested which is causing a large amount of redundant...

  • 0 kudos
sondergaard
by New Contributor II
  • 1578 Views
  • 2 replies
  • 0 kudos

Simba ODBC driver // .Net Core

Hi,I have been looking into the Simba Spark ODBC driver to see if it can simplify our integration with .Net Core. The first results were promising, but when I started to process larger queries I started to notice out-of-memory exceptions in the conta...

  • 1578 Views
  • 2 replies
  • 0 kudos
Latest Reply
Rjdudley
Honored Contributor
  • 0 kudos

Something we're considering for a similar purpose (.NET Core service pulling data from Databricks) is the ADO.NET connector from CData: Databricks Driver: ADO.NET Provider | Create & integrate .NET apps

  • 0 kudos
1 More Replies
ashraf1395
by Honored Contributor
  • 1287 Views
  • 1 replies
  • 0 kudos

Fething the catalog and schema which is set in dlt pipeline configuration

I have a dlt pipeline and the notebook which is running on the dlt pipeline has some requirements.I want to get the catalog and schema which is set my dlt pipeline. Reason for it: I have to specify my volume files paths etc and my volume is on the sa...

  • 1287 Views
  • 1 replies
  • 0 kudos
Latest Reply
SP_6721
Honored Contributor II
  • 0 kudos

Hi @ashraf1395 Can you try this to get the catalog and schema set by your DLT pipeline in the notebookcatalog = spark.conf.get("pipelines.catalog")schema = spark.conf.get("pipelines.schema")

  • 0 kudos
ankit001mittal
by New Contributor III
  • 860 Views
  • 1 replies
  • 0 kudos

DLT Pipeline Stats on Object level

Hi Guys,I want to create a table where I want to store information about each DLT pipelines on object/table id level details about how much time it took for waiting for resources and how much time it took to run for each object and numbers or records...

Data Engineering
dlt
system
  • 860 Views
  • 1 replies
  • 0 kudos
Latest Reply
RiyazAliM
Honored Contributor
  • 0 kudos

Hi @ankit001mittal DLT Event logs helps you to gather most of the information you've mentioned above. Below is the documentation to the DLT Event Logs:https://docs.databricks.com/aws/en/dlt/observabilityLet me know if any questions.Best,

  • 0 kudos
Ekaterina_Paste
by New Contributor III
  • 21375 Views
  • 12 replies
  • 2 kudos

Resolved! Can't login to databricks community edition

I enter my valid login and password here https://community.cloud.databricks.com/login.html but it says "Invalid email address or password"

  • 21375 Views
  • 12 replies
  • 2 kudos
Latest Reply
Venkat124488
New Contributor II
  • 2 kudos

data bricks cluster is terminating each 15 sec in community edition. Could you please help me on this issue.  

  • 2 kudos
11 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels