Re: Read excel files and append to make one data f...

Ryan_Chynoweth · ‎06-21-2021

If you are attempting to read all the files in a directory you should be able to use a wild card and filter using the extension. For example:

df = (spark
.read
.format("com.crealytics.spark.excel")
.option("header", "True")
.option("inferSchema", "true")
.option("dataAddress", "'Usage Dataset'!A2")
.load('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/*_Usage_Dataset.xlsx')
 )

Should read all the .xlsx files in that directory.

If you want to read a subset of files then you can loop through, get all the file paths in a python list, and then provide that list when reading. For example:

files = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
files_list = []
 
for f in files:
    # use an if statement to determine if you want to append the path to the list
    files_list.append(f.path)
 
df = (spark
.read
.format("com.crealytics.spark.excel")
.option("header", "True")
.option("inferSchema", "true")
.option("dataAddress", "'Usage Dataset'!A2")
.load(files_list)
 )