09-11-2024 02:05 AM
1. I have alot of data transformation which result in df1, then df1 is starting point for 10 different transformations path. At first I tried to use .cache() and .count() on df1 but it was very slow. I changed caching to saving df1 as delta table and it improved performence significantly. I thought that cache is better approach so what could be the issue there? I checked and I have io cache enabled in spark conf.
2. What is the most efficient way to read small (less than 10 mb) excel file? Is it better to use pandas and transform it to spark df or use library like crealytics?
09-11-2024 02:11 AM
Hi @baert23 ,
To read Excel just use pandas and convert it to spark dataframe.
That's straightforward way to work with excel.
import pandas as pd
pandas_df = pd.read_excel("path_to_your_excel_file.xlsx")
spark_df = spark.createDataFrame(pandas_df)
If you have still any issues, share the transformations you are doing. Let's see whether we can optimize further
09-11-2024 04:34 AM
As far as I know pandas do not support adfss so I would have to move excel file to file:tmp on local memory which I am not sure is good idea. Anyway in this case would not it be better to use polars instead of pandas? I noticed that after reading one excel file about 10 mb and then just trying to display it took almost 1 minute and all subsequent transformations were slow from this point.
09-11-2024 04:47 AM - edited 09-11-2024 04:47 AM
Hi @baert23 ,
here are the guildelines how to read drom ADLS using pandas:
https://stackoverflow.com/questions/62670991/read-csv-from-azure-blob-storage-and-store-in-a-datafra...
It seems there are libraries that allow this connection: fpsspec, adlfs, Azure Storage Blob
Will that work for you?
09-11-2024 05:17 AM
These libraries only work on abfs protocol, right? I need to use abfss protocol and I am not allowed to use dbfs. I think crealytics is not that bad there? I am more concerned why after excel file was read, it is so slow to make further transformation when resulting df is not that big (about 10k rows). Also what about 1st question?
09-11-2024 05:56 AM
It is hard to tell why the transformations are slow without any details.
You would need to share the file you are trying process and the code you are running.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group