Reading Excel files folder
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-08-2024 02:58 AM
Dears,
One of the tasks needed by DE is to ingest data from files, for example, Excel file.
Thanks for OnerFusion-AI for the below thread that give us the steps of reading from one file
in addition, I provide the below code in case of reading all the Excel files in a folder:
IMP Note:
- All files must have the same structure.
Steps:
1- You need to upload the Excel files under a DBFS folder.
2- Use the below code to read each file and combine them to a single CSV file
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
.appName("ReadExcelWithHeader") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.13.5") \
.getOrCreate()
# Define the directory containing Excel files
excel_dir_path = "/FileStore/tables"
# List all files in the directory using dbutils.fs.ls
all_files = dbutils.fs.ls(excel_dir_path)
# Filter to get only the .xlsx files
excel_files = [file.path for file in all_files if file.path.endswith(".xlsx")]
# Initialize an empty DataFrame
df_combined = None
# Loop through each Excel file and read it into a DataFrame
for excel_file in excel_files:
df = spark.read \
.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(excel_file)
# Combine the DataFrames
if df_combined is None:
df_combined = df
else:
df_combined = df_combined.union(df)
# Check if df_combined is not None before writing to CSV
if df_combined is not None:
# Define the output CSV file path
csv_file_path = "/FileStore/tables/output_file.csv"
# Save the combined DataFrame as a CSV file
df_combined.write.csv(csv_file_path, header=True, mode='overwrite')
print(f"Excel files in {excel_dir_path} have been successfully converted to {csv_file_path}")
else:
print(f"No Excel files found in {excel_dir_path}")
# Stop the Spark session
#spark.stop()
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-08-2024 06:30 AM
Hi @AhmedAlnaqa ,
Thank you for sharing this. I am sure it will help other community members.
Thanks,
Rishabh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-16-2024 06:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-16-2024 07:08 AM
Hi @maddy08 ,
You can read from abfss using com.crealytics:spark-excel. You can refer to the below video as an example:
Read excel file in databricks using python and scala #spark (youtube.com)

