I am storing excel files in Azure data lake (gen 1). They follow filenames follow the same pattern "2021-06-18T09_00_07ONR_Usage_Dataset", "2021-06-18T09_00_07DSS_Usage_Dataset", etc. depending on the date and time. I want to read all the files in the folder located in Azure data lake to databricks without having to name the specific file so in the future new files are read and appended to make one big data set. The files are all the same schema, columns are in the same order, etc. So far I have tried for loops with regex expressions:
path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
for fi in path: `for fi in path:
print(fi)
read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
display(read)
print(read.count())
The output print all the paths and it counts each dataset that is being read, but it only displays the last one. I understand because I'm not storing it or appending in the for loop, but when I add append it breaks.
appended_data = []
path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
for fi in path: `for fi in path:
print(fi)
read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
display(read)
print(read.count())
appended_data.append(read)
But I get this error, FileInfo(path='dbfs:/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/Initialization_DSS.xlsx', name='Initialization_DSS.xlsx', size=39781) TypeError: not supported type: <class 'py4j.java_gateway.JavaObject'>
The final way I tried:
li = []
for f in glob.glob('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/*_Usage_Dataset.xlsx'):
df = pd.read_xlsx(f)
li.append(df)
frame = pd.concat(li, axis =0, ignore_index = True)
This says that there are no object to concatenate. I have been researching everywhere and trying everything. Please help.