I can read all csvs under an S3 uri byu doing:
files = dbutils.fs.ls('s3://example-path')
df = spark.read.options(header='true',
encoding='iso-8859-1',
dateFormat='yyyyMMdd',
ignoreLeadingWhiteSpace='true',
ignoreTrailingWhiteSpace='true')\
.csv(filename)
However, the columns of these csvs are all different.
e.g. File1 has columns A, B, C; file2 has columns C, A, D, E; file3 has columns B, F. I want to read all these files so that the resulting dataframe has columns A, B, C, D, E, F with all columns being read correctly.
I could iterate through every file, read it individually, and then union them to create a bigger dataframe, but is there a better way to do this?