topic Re: pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error in Data Engineering

pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error

Christine — Fri, 21 Jul 2023 13:41:36 GMT

Not sure if this is the right place to ask this question, so let me know if it is not.

I am trying to read an xls file which containts #REF values in databricks with pyspark.pandas. When I try to read the file with "pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str').to_spark()" I get the error "ERROR *** Token 0x2d (AreaN) found in NAME formula".

I assume the error is occurring due to the #REF values in the file, so my questions is if there is any way I can ignore the error and convert #REF to None? I need to read it into pyspark.

I tried pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce').to_spark() and expected it to read the file, but get the error "read_excel() got an unexpected keyword argument 'errors'".

Thank you!

Re: pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error

Siebert_Looije — Tue, 01 Aug 2023 19:30:31 GMT

Hi,

Thanks for you message.
You might want to directly load the dataframe into the spark dataframe. A couple of example can be found in this stack overflow link: https://stackoverflow.com/questions/56426069/how-to-read-xlsx-or-xls-files-as-spark-dataframe.

If this doesn't help, please ask because I will deep further into it.

Kind regards,

Re: pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error

Christine — Wed, 02 Aug 2023 08:05:42 GMT

Hi,

Thank you for your reply 🙂

I have already tried spark.read.format("com.crealytics.spark.excel"), but the excel file version I am using is too old. I get the error "The supplied spreadsheet seems to be Excel 5.0/7.0 (BIFF5) format. POI only supports BIFF8 format (from Excel versions 97/2000/XP/2003)"

I also tried spark.read.format("excel"), but this command never finish.

I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the @ symbol in these cells, I am able to read the file.

Would this help you in the investigation?

Re: pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error

Christine — Wed, 02 Aug 2023 11:02:46 GMT

Hi Kaniz,

Thank you for your reply.

I tried the following code:

import pandas as pd

df = pd.read_excel(file_path, sheet_name='sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce'), but this resulted in the error "read_excel() got an unexpected keyword argument 'errors'".

Re: pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error

youssefmrini — Tue, 08 Aug 2023 16:03:19 GMT

It sounds like you're trying to open an Excel file that has some invalid references, which is causing an error when you try to read it with pyspark.pandas.read_excel().

One way to handle invalid references is to use the openpyxl engine instead of xlrd. openpyxl can handle invalid references and replace them with a None value.

Here's an example of how you can read your Excel file using pyspark.pandas and the openpyxl engine:

python

import pandas as pd
from pyspark.sql.functions import col
from pyspark.sql.types import StringType
import pyspark.pandas as ps

# Set up the file path and sheet name
file_path = "/path/to/your/file.xlsx"
sheet_name = "sheet1"

# Set up the options and read the file
options = dict(header=1, keep_default_na=False, engine="openpyxl")
df_pandas = pd.read_excel(file_path, sheet_name=sheet_name, **options)

# Convert the pandas dataframe to a PySpark DataFrame
df_spark = ps.DataFrame(df_pandas).to_spark()

# Replace #REF values with None
df_spark = df_spark.withColumn(
"_tmp",
col("invalid_column_name").cast(StringType()).cast("double")
).drop("invalid_column_name")

# Show the resulting dataframe
df_spark.show()

In this example, read_excel() is configured to use the openpyxl engine instead of xlrd using the engine="openpyxl" option. This allows you to read the Excel file and handle invalid references.

After reading the file, the resulting Pandas dataframe is converted to a PySpark dataframe using pyspark.pandas.DataFrame(df_pandas).to_spark(). A temporary column ("_tmp") is then created by casting the problematic column to a double, and is then cast again to string. Finally, #REF values are replaced with None.

This approach should allow you to read your Excel file into PySpark and handle invalid references.