Databricks Community

547284 · ‎11-17-2022

I can read all csvs under an S3 uri byu doing:

files = dbutils.fs.ls('s3://example-path')

df = spark.read.options(header='true',

encoding='iso-8859-1',

dateFormat='yyyyMMdd',

ignoreLeadingWhiteSpace='true',

ignoreTrailingWhiteSpace='true')\

.csv(filename)

However, the columns of these csvs are all different.

e.g. File1 has columns A, B, C; file2 has columns C, A, D, E; file3 has columns B, F. I want to read all these files so that the resulting dataframe has columns A, B, C, D, E, F with all columns being read correctly.

I could iterate through every file, read it individually, and then union them to create a bigger dataframe, but is there a better way to do this?

Debayan · ‎11-17-2022

Hi @Anthony Wang As of now, I think that's the only way. Please refer: https://docs.databricks.com/external-data/csv.html#pitfalls-of-reading-a-subset-of-columns. Please let us know if this helps.

Databricks Community

How to read in csvs from s3 directory with different columns

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences