cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to read in csvs from s3 directory with different columns

547284
New Contributor II

I can read all csvs under an S3 uri byu doing:

files = dbutils.fs.ls('s3://example-path')

df = spark.read.options(header='true',

            encoding='iso-8859-1',

            dateFormat='yyyyMMdd',

            ignoreLeadingWhiteSpace='true',

            ignoreTrailingWhiteSpace='true')\

          .csv(filename)

However, the columns of these csvs are all different.

e.g. File1 has columns A, B, C; file2 has columns C, A, D, E; file3 has columns B, F. I want to read all these files so that the resulting dataframe has columns A, B, C, D, E, F with all columns being read correctly.

I could iterate through every file, read it individually, and then union them to create a bigger dataframe, but is there a better way to do this?

1 REPLY 1

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi @Anthony Wang​ As of now, I think that's the only way. Please refer: https://docs.databricks.com/external-data/csv.html#pitfalls-of-reading-a-subset-of-columns. Please let us know if this helps.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.