Databricks Community

baert23 · ‎09-11-2024

1. I have alot of data transformation which result in df1, then df1 is starting point for 10 different transformations path. At first I tried to use .cache() and .count() on df1 but it was very slow. I changed caching to saving df1 as delta table and it improved performence significantly. I thought that cache is better approach so what could be the issue there? I checked and I have io cache enabled in spark conf.

2. What is the most efficient way to read small (less than 10 mb) excel file? Is it better to use pandas and transform it to spark df or use library like crealytics?

filipniziol · ‎09-11-2024

Hi @baert23 ,

To read Excel just use pandas and convert it to spark dataframe.
That's straightforward way to work with excel.

import pandas as pd

pandas_df = pd.read_excel("path_to_your_excel_file.xlsx")
spark_df = spark.createDataFrame(pandas_df)

If you have still any issues, share the transformations you are doing. Let's see whether we can optimize further

baert23 · ‎09-11-2024

As far as I know pandas do not support adfss so I would have to move excel file to file:tmp on local memory which I am not sure is good idea. Anyway in this case would not it be better to use polars instead of pandas? I noticed that after reading one excel file about 10 mb and then just trying to display it took almost 1 minute and all subsequent transformations were slow from this point.

filipniziol · ‎09-11-2024

Hi @baert23 ,

here are the guildelines how to read drom ADLS using pandas:
https://stackoverflow.com/questions/62670991/read-csv-from-azure-blob-storage-and-store-in-a-datafra...

It seems there are libraries that allow this connection: fpsspec, adlfs, Azure Storage Blob

Will that work for you?

baert23 · ‎09-11-2024

These libraries only work on abfs protocol, right? I need to use abfss protocol and I am not allowed to use dbfs. I think crealytics is not that bad there? I am more concerned why after excel file was read, it is so slow to make further transformation when resulting df is not that big (about 10k rows). Also what about 1st question?