How to read in csvs from s3 directory with different columns

547284
New Contributor II

I can read all csvs under an S3 uri byu doing:

files = dbutils.fs.ls('s3://example-path')

df = spark.read.options(header='true',

            encoding='iso-8859-1',

            dateFormat='yyyyMMdd',

            ignoreLeadingWhiteSpace='true',

            ignoreTrailingWhiteSpace='true')\

          .csv(filename)

However, the columns of these csvs are all different.

e.g. File1 has columns A, B, C; file2 has columns C, A, D, E; file3 has columns B, F. I want to read all these files so that the resulting dataframe has columns A, B, C, D, E, F with all columns being read correctly.

I could iterate through every file, read it individually, and then union them to create a bigger dataframe, but is there a better way to do this?

Debayan
Databricks Employee
Databricks Employee

Hi @Anthony Wang​ As of now, I think that's the only way. Please refer: https://docs.databricks.com/external-data/csv.html#pitfalls-of-reading-a-subset-of-columns. Please let us know if this helps.