cancel
Showing results for 
Search instead for 
Did you mean: 
Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

Reading Excel files folder

AhmedAlnaqa
Contributor

Dears,

One of the tasks needed by DE is to ingest data from files, for example, Excel file.

Thanks for OnerFusion-AI for the below thread that give us the steps of reading from one file 

https://community.databricks.com/t5/get-started-discussions/how-to-import-excel-on-databricks/td-p/4...

in addition, I provide the below code in case of reading all the Excel files in a folder:

IMP Note:

- All files must have the same structure.

Steps:

1- You need to upload the Excel files under a DBFS folder.

2- Use the below code to read each file and combine them to a single CSV file

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("ReadExcelWithHeader") \
    .config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.13.5") \
    .getOrCreate()

# Define the directory containing Excel files
excel_dir_path = "/FileStore/tables"

# List all files in the directory using dbutils.fs.ls
all_files = dbutils.fs.ls(excel_dir_path)

# Filter to get only the .xlsx files
excel_files = [file.path for file in all_files if file.path.endswith(".xlsx")]

# Initialize an empty DataFrame
df_combined = None

# Loop through each Excel file and read it into a DataFrame
for excel_file in excel_files:
    df = spark.read \
        .format("com.crealytics.spark.excel") \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .load(excel_file)
    
    # Combine the DataFrames
    if df_combined is None:
        df_combined = df
    else:
        df_combined = df_combined.union(df)

# Check if df_combined is not None before writing to CSV
if df_combined is not None:
    # Define the output CSV file path
    csv_file_path = "/FileStore/tables/output_file.csv"
    
    # Save the combined DataFrame as a CSV file
    df_combined.write.csv(csv_file_path, header=True, mode='overwrite')
    
    print(f"Excel files in {excel_dir_path} have been successfully converted to {csv_file_path}")
else:
    print(f"No Excel files found in {excel_dir_path}")

# Stop the Spark session
#spark.stop()

Thanks

 

3 REPLIES 3

Rishabh_Tiwari
Databricks Employee
Databricks Employee

Hi @AhmedAlnaqa ,

Thank you for sharing this. I am sure it will help other community members.

Thanks,

Rishabh

maddy08
New Contributor II

Hi @AhmedAlnaqa ,

Can we read from ADLS location too by using abfss ?

Thanks

Hi @maddy08 ,

You can read from abfss using com.crealytics:spark-excel. You can refer to the below video as an example:

Read excel file in databricks using python and scala #spark (youtube.com)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group