cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Reading Excel files folder

AhmedAlnaqa
Contributor

Dears,

One of the tasks needed by DE is to ingest data from files, for example, Excel file.

Thanks for OnerFusion-AI for the below thread that give us the steps of reading from one file 

https://community.databricks.com/t5/get-started-discussions/how-to-import-excel-on-databricks/td-p/4...

in addition, I provide the below code in case of reading all the Excel files in a folder:

IMP Note:

- All files must have the same structure.

Steps:

1- You need to upload the Excel files under a DBFS folder.

2- Use the below code to read each file and combine them to a single CSV file

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("ReadExcelWithHeader") \
    .config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.13.5") \
    .getOrCreate()

# Define the directory containing Excel files
excel_dir_path = "/FileStore/tables"

# List all files in the directory using dbutils.fs.ls
all_files = dbutils.fs.ls(excel_dir_path)

# Filter to get only the .xlsx files
excel_files = [file.path for file in all_files if file.path.endswith(".xlsx")]

# Initialize an empty DataFrame
df_combined = None

# Loop through each Excel file and read it into a DataFrame
for excel_file in excel_files:
    df = spark.read \
        .format("com.crealytics.spark.excel") \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .load(excel_file)
    
    # Combine the DataFrames
    if df_combined is None:
        df_combined = df
    else:
        df_combined = df_combined.union(df)

# Check if df_combined is not None before writing to CSV
if df_combined is not None:
    # Define the output CSV file path
    csv_file_path = "/FileStore/tables/output_file.csv"
    
    # Save the combined DataFrame as a CSV file
    df_combined.write.csv(csv_file_path, header=True, mode='overwrite')
    
    print(f"Excel files in {excel_dir_path} have been successfully converted to {csv_file_path}")
else:
    print(f"No Excel files found in {excel_dir_path}")

# Stop the Spark session
#spark.stop()

Thanks

 

3 REPLIES 3

Rishabh_Tiwari
Databricks Employee
Databricks Employee

Hi @AhmedAlnaqa ,

Thank you for sharing this. I am sure it will help other community members.

Thanks,

Rishabh

maddy08
New Contributor II

Hi @AhmedAlnaqa ,

Can we read from ADLS location too by using abfss ?

Thanks

szymon_dybczak
Esteemed Contributor III

Hi @maddy08 ,

You can read from abfss using com.crealytics:spark-excel. You can refer to the below video as an example:

Read excel file in databricks using python and scala #spark (youtube.com)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group