cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Code Optimization

baert23
New Contributor II

1. I have alot of data transformation which result in df1, then df1 is starting point for 10 different transformations path. At first I tried to use .cache() and .count() on df1 but it was very slow. I changed caching to saving df1 as delta table and it improved performence significantly.  I thought that cache is better approach so what could be the issue there? I checked and I have io cache enabled in spark conf. 

2. What is the most efficient way to read small (less than 10 mb) excel file? Is it better to use pandas and transform it to spark df or use library like crealytics?

 

5 REPLIES 5

filipniziol
Contributor III

Hi @baert23 ,

To read Excel just use pandas and convert it to spark dataframe.
That's straightforward way to work with excel.

import pandas as pd

pandas_df = pd.read_excel("path_to_your_excel_file.xlsx")
spark_df = spark.createDataFrame(pandas_df)

If you have still any issues, share the transformations you are doing. Let's see whether we can optimize further

 

baert23
New Contributor II

As far as I know pandas do not support adfss so I would have to move excel file to file:tmp on local memory which I am not sure is good idea. Anyway in this case would not it be better to use polars instead of pandas? I noticed that after reading one excel file about 10 mb and then just trying to display it took almost 1 minute and all subsequent transformations were slow from this point. 

filipniziol
Contributor III

Hi @baert23 ,

here are the guildelines how to read drom ADLS using pandas:
https://stackoverflow.com/questions/62670991/read-csv-from-azure-blob-storage-and-store-in-a-datafra...

It seems there are libraries that allow this connection: fpsspec, adlfs, Azure Storage Blob

Will that work for you?

baert23
New Contributor II

These libraries only work on abfs protocol, right? I need to use abfss protocol and I am not allowed to use dbfs. I think crealytics is not that bad there? I am more concerned why after excel file was read, it is so slow to make further transformation when resulting df is not that big (about 10k rows). Also what about 1st question?

filipniziol
Contributor III

It is hard to tell why the transformations are slow without any details.

You would need to share the file you are trying process and the code you are running. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group