cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Read excel files and append to make one data frame in Databricks from azure data lake without specific file names

User16765131552
Contributor III

I am storing excel files in Azure data lake (gen 1). They follow filenames follow the same pattern "2021-06-18T09_00_07ONR_Usage_Dataset", "2021-06-18T09_00_07DSS_Usage_Dataset", etc. depending on the date and time. I want to read all the files in the folder located in Azure data lake to databricks without having to name the specific file so in the future new files are read and appended to make one big data set. The files are all the same schema, columns are in the same order, etc. So far I have tried for loops with regex expressions:

path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
for fi in path: `for fi in path: 
  print(fi)
  read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
  display(read)
  print(read.count())

The output print all the paths and it counts each dataset that is being read, but it only displays the last one. I understand because I'm not storing it or appending in the for loop, but when I add append it breaks.

appended_data = []
path = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
    for fi in path: `for fi in path: 
      print(fi)
      read = spark.read.format("com.crealytics.spark.excel").option("header", "True").option("inferSchema", "true").option("dataAddress", "'Usage Dataset'!A2").load(fi.path)
      display(read)
      print(read.count())
      appended_data.append(read)

But I get this error, FileInfo(path='dbfs:/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/Initialization_DSS.xlsx', name='Initialization_DSS.xlsx', size=39781) TypeError: not supported type: <class 'py4j.java_gateway.JavaObject'>

The final way I tried:

li = []
for f in glob.glob('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/*_Usage_Dataset.xlsx'):
    df = pd.read_xlsx(f)  
    li.append(df)
    frame = pd.concat(li, axis =0, ignore_index = True)

This says that there are no object to concatenate. I have been researching everywhere and trying everything. Please help.

1 REPLY 1

Ryan_Chynoweth
Esteemed Contributor

If you are attempting to read all the files in a directory you should be able to use a wild card and filter using the extension. For example:

df = (spark
.read
.format("com.crealytics.spark.excel")
.option("header", "True")
.option("inferSchema", "true")
.option("dataAddress", "'Usage Dataset'!A2")
.load('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/*_Usage_Dataset.xlsx')
 )

Should read all the .xlsx files in that directory.

If you want to read a subset of files then you can loop through, get all the file paths in a python list, and then provide that list when reading. For example:

files = dbutils.fs.ls('/mnt/adls/40_project/UBC/WIP/Mercury/UUR_PS_raw_temp/')
files_list = []
 
for f in files:
    # use an if statement to determine if you want to append the path to the list
    files_list.append(f.path)
 
df = (spark
.read
.format("com.crealytics.spark.excel")
.option("header", "True")
.option("inferSchema", "true")
.option("dataAddress", "'Usage Dataset'!A2")
.load(files_list)
 )

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group