cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF error

Christine
Contributor II

Not sure if this is the right place to ask this question, so let me know if it is not. 

I am trying to read an xls file which containts #REF values in databricks with pyspark.pandas. When I try to read the file with "pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str').to_spark()" I get the error "ERROR *** Token 0x2d (AreaN) found in NAME formula".

I assume the error is occurring due to the #REF values in the file, so my questions is if there is any way I can ignore the error and convert #REF to None? I need to read it into pyspark.

I tried pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce').to_spark() and expected it to read the file, but get the error "read_excel() got an unexpected keyword argument 'errors'".

Thank you!

4 REPLIES 4

Siebert_Looije
Contributor

Hi,

Thanks for you message. 
You might want to directly load the dataframe into the spark dataframe. A couple of example can be found in this stack overflow link: https://stackoverflow.com/questions/56426069/how-to-read-xlsx-or-xls-files-as-spark-dataframe. 

If this doesn't help, please ask because I will deep further into it.

Kind regards,

Hi, 

Thank you for your reply 🙂 

I have already tried spark.read.format("com.crealytics.spark.excel"), but the excel file version I am using is too old. I get the error "The supplied spreadsheet seems to be Excel 5.0/7.0 (BIFF5) format. POI only supports BIFF8 format (from Excel versions 97/2000/XP/2003)" 

I also tried spark.read.format("excel"), but this command never finish.

I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the @ symbol in these cells, I am able to read the file. 

Would this help you in the investigation? 

Hi Kaniz, 

Thank you for your reply. 

I have been looking more into the error, and apparently the reason why the file cannot be loaded is due to an error that pops up when I open the excel sheet "One or more invalid names were detected in this workbook. those invalid names have been changed to #REF!# ". When I want to save the excel file, it says the error is found in cells with functions like: "=IF(B9="n.a";"string";IF(COUNTIF(@E9:E10;"NO")=0;"YES";"NO"))". When I remove the @ symbol in these cells, I am able to read the file. 

I tried the following code: 

import pandas as pd
df = pd.read_excel(file_path, sheet_name='sheet_name', engine='xlrd', convert_float=False, dtype='str', errors='coerce'), but this resulted in the error "read_excel() got an unexpected keyword argument 'errors'". 

 

youssefmrini
Honored Contributor III

 

It sounds like you're trying to open an Excel file that has some invalid references, which is causing an error when you try to read it with pyspark.pandas.read_excel().

One way to handle invalid references is to use the openpyxl engine instead of xlrd. openpyxl can handle invalid references and replace them with a None value.

Here's an example of how you can read your Excel file using pyspark.pandas and the openpyxl engine:

 

python
import pandas as pd
from pyspark.sql.functions import col
from pyspark.sql.types import StringType
import pyspark.pandas as ps

# Set up the file path and sheet name
file_path = "/path/to/your/file.xlsx"
sheet_name = "sheet1"

# Set up the options and read the file
options = dict(header=1, keep_default_na=False, engine="openpyxl")
df_pandas = pd.read_excel(file_path, sheet_name=sheet_name, **options)

# Convert the pandas dataframe to a PySpark DataFrame
df_spark = ps.DataFrame(df_pandas).to_spark()

# Replace #REF values with None
df_spark = df_spark.withColumn(
    "_tmp",
    col("invalid_column_name").cast(StringType()).cast("double")
).drop("invalid_column_name")

# Show the resulting dataframe
df_spark.show()
 

In this example, read_excel() is configured to use the openpyxl engine instead of xlrd using the engine="openpyxl" option. This allows you to read the Excel file and handle invalid references.

After reading the file, the resulting Pandas dataframe is converted to a PySpark dataframe using pyspark.pandas.DataFrame(df_pandas).to_spark(). A temporary column ("_tmp") is then created by casting the problematic column to a double, and is then cast again to string. Finally, #REF values are replaced with None.

This approach should allow you to read your Excel file into PySpark and handle invalid references.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group