cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error

Christine
Contributor

Not sure if this is the right place to ask this question, so let me know if it is not. 

I am trying to read an xls file which containts #REF values in databricks with pyspark.pandas. When I try to read the file with "pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str').to_spark()" I get the error "ERROR *** Token 0x2d (AreaN) found in NAME formula".

I assume the error is occurring due to the #REF values in the file, so my questions is if there is any way I can ignore the error and convert #REF to None? I need to read it into pyspark.

I tried pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce').to_spark() and expected it to read the file, but get the error "read_excel() got an unexpected keyword argument 'errors'".

Thank you!

5 REPLIES 5

Siebert_Looije
Contributor

Hi,

Thanks for you message. 
You might want to directly load the dataframe into the spark dataframe. A couple of example can be found in this stack overflow link: https://stackoverflow.com/questions/56426069/how-to-read-xlsx-or-xls-files-as-spark-dataframe. 

If this doesn't help, please ask because I will deep further into it.

Kind regards,

Hi, 

Thank you for your reply 🙂 

I have already tried spark.read.format("com.crealytics.spark.excel"), but the excel file version I am using is too old. I get the error "The supplied spreadsheet seems to be Excel 5.0/7.0 (BIFF5) format. POI only supports BIFF8 format (from Excel versions 97/2000/XP/2003)" 

I also tried spark.read.format("excel"), but this command never finish.

I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the @ symbol in these cells, I am able to read the file. 

Would this help you in the investigation? 

Kaniz
Community Manager
Community Manager

Hi @Christine, you are trying to read an xls file containing #REF values in Databricks with PySpark Pandas based on the information provided.

When you try to read the file with "pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str').to_spark()", you get the error "ERROR *** Token 0x2d (AreaN) found in NAME formula". It is assumed that the error is occurring due to the #REF values in the file, and you want to know if there is any way to ignore the error and convert #REF to None.

You also tried pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce').to_spark() but got the error "read_excel() got an unexpected keyword argument 'errors'".Unfortunately, the errors the parameter is not supported in pyspark.pandas.read_excel().

However, you can try the following workaround to convert #REF values to None:

1. Read the Excel file using which supports the errors parameter. Set errors='coerce' to convert #REF values to NaN. 

   import pandas as pd
  df = pd.read_excel(file_path, sheet_name='sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce')
 

2. Replace NaN values with None using the fillna() method. 

   df = df.fillna(value=None)
 

3. Convert the resulting pandas DataFrame to a PySpark DataFrame using spark.createDataFrame(). 

   from pyspark.sql import SparkSession
  spark = SparkSession.builder.appName("example").getOrCreate()
  spark_df = spark.createDataFrame(df)
 

 

 

Hi Kaniz, 

Thank you for your reply. 

I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the @ symbol in these cells, I am able to read the file. 

I tried the following code: 

import pandas as pd
df = pd.read_excel(file_path, sheet_name='sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce'), but this resulted in the error "read_excel() got an unexpected keyword argument 'errors'". 

 

youssefmrini
Honored Contributor III
Honored Contributor III

 

It sounds like you're trying to open an Excel file that has some invalid references, which is causing an error when you try to read it with pyspark.pandas.read_excel().

One way to handle invalid references is to use the openpyxl engine instead of xlrd. openpyxl can handle invalid references and replace them with a None value.

Here's an example of how you can read your Excel file using pyspark.pandas and the openpyxl engine:

 

python
import pandas as pd
from pyspark.sql.functions import col
from pyspark.sql.types import StringType
import pyspark.pandas as ps

# Set up the file path and sheet name
file_path = "/path/to/your/file.xlsx"
sheet_name = "sheet1"

# Set up the options and read the file
options = dict(header=1, keep_default_na=False, engine="openpyxl")
df_pandas = pd.read_excel(file_path, sheet_name=sheet_name, **options)

# Convert the pandas dataframe to a PySpark DataFrame
df_spark = ps.DataFrame(df_pandas).to_spark()

# Replace #REF values with None
df_spark = df_spark.withColumn(
    "_tmp",
    col("invalid_column_name").cast(StringType()).cast("double")
).drop("invalid_column_name")

# Show the resulting dataframe
df_spark.show()
 

In this example, read_excel() is configured to use the openpyxl engine instead of xlrd using the engine="openpyxl" option. This allows you to read the Excel file and handle invalid references.

After reading the file, the resulting Pandas dataframe is converted to a PySpark dataframe using pyspark.pandas.DataFrame(df_pandas).to_spark(). A temporary column ("_tmp") is then created by casting the problematic column to a double, and is then cast again to string. Finally, #REF values are replaced with None.

This approach should allow you to read your Excel file into PySpark and handle invalid references.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.