pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-21-2023 06:41 AM
Not sure if this is the right place to ask this question, so let me know if it is not.
I am trying to read an xls file which containts #REF values in databricks with pyspark.pandas. When I try to read the file with "pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str').to_spark()" I get the error "ERROR *** Token 0x2d (AreaN) found in NAME formula".
I assume the error is occurring due to the #REF values in the file, so my questions is if there is any way I can ignore the error and convert #REF to None? I need to read it into pyspark.
I tried pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce').to_spark() and expected it to read the file, but get the error "read_excel() got an unexpected keyword argument 'errors'".
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-01-2023 12:30 PM
Hi,
Thanks for you message.
You might want to directly load the dataframe into the spark dataframe. A couple of example can be found in this stack overflow link: https://stackoverflow.com/questions/56426069/how-to-read-xlsx-or-xls-files-as-spark-dataframe.
If this doesn't help, please ask because I will deep further into it.
Kind regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-02-2023 01:05 AM
Hi,
Thank you for your reply 🙂
I have already tried spark.read.format("com.crealytics.spark.excel"), but the excel file version I am using is too old. I get the error "The supplied spreadsheet seems to be Excel 5.0/7.0 (BIFF5) format. POI only supports BIFF8 format (from Excel versions 97/2000/XP/2003)"
I also tried spark.read.format("excel"), but this command never finish.
I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the @ symbol in these cells, I am able to read the file.
Would this help you in the investigation?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-02-2023 04:02 AM
Hi Kaniz,
Thank you for your reply.
I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the @ symbol in these cells, I am able to read the file.
I tried the following code:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-08-2023 09:03 AM
It sounds like you're trying to open an Excel file that has some invalid references, which is causing an error when you try to read it with pyspark.pandas.read_excel().
One way to handle invalid references is to use the openpyxl engine instead of xlrd. openpyxl can handle invalid references and replace them with a None value.
Here's an example of how you can read your Excel file using pyspark.pandas and the openpyxl engine:
from pyspark.sql.functions import col
from pyspark.sql.types import StringType
import pyspark.pandas as ps
# Set up the file path and sheet name
file_path = "/path/to/your/file.xlsx"
sheet_name = "sheet1"
# Set up the options and read the file
options = dict(header=1, keep_default_na=False, engine="openpyxl")
df_pandas = pd.read_excel(file_path, sheet_name=sheet_name, **options)
# Convert the pandas dataframe to a PySpark DataFrame
df_spark = ps.DataFrame(df_pandas).to_spark()
# Replace #REF values with None
df_spark = df_spark.withColumn(
"_tmp",
col("invalid_column_name").cast(StringType()).cast("double")
).drop("invalid_column_name")
# Show the resulting dataframe
df_spark.show()
In this example, read_excel() is configured to use the openpyxl engine instead of xlrd using the engine="openpyxl" option. This allows you to read the Excel file and handle invalid references.
After reading the file, the resulting Pandas dataframe is converted to a PySpark dataframe using pyspark.pandas.DataFrame(df_pandas).to_spark(). A temporary column ("_tmp") is then created by casting the problematic column to a double, and is then cast again to string. Finally, #REF values are replaced with None.
This approach should allow you to read your Excel file into PySpark and handle invalid references.

