Databricks Community

irfanaziz · ‎01-13-2022

I have observed a very strange behavior with some of our integration pipelines. This week one of the csv files was getting broken when read with read function given below.

def ReadCSV(files,schema_struct,header,delimiter,timestampformat,encode="utf8",multiLine="true"):
  deltas_df = spark.read \
      .format('csv') \
      .options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine=multiLine) \
      .schema(schema=schema_struct).load(files)  
  return df

I made changes and moved the schema in the options. This worked and was able to read the file for that object. But it started failing for the other objects. So i am wondering why would it behave so differently.

def ReadCSV2(files,schema_struct,header,delimiter,timestampformat,encode="utf8"):
  deltas_df = spark.read \
      .format('csv') \
      .options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine="true",schema=schema_struct) \
      .load(files)  
  return df

I would like to keep one function and solve this issue. For now i have to use two functions.

Hubert-Dudek · ‎01-14-2022

How exactly failing?

Maybe there are differences in csv header including casesensivity so enforceSchema = False could maybe help.

Regarding schema under the hood it points to the same scala function.

View solution in original post

Anonymous · ‎01-13-2022

Hello @nafri A - My name is Piper, and I'm a moderator for Databricks. Welcome to the community and thank you for your question. I'm sorry to hear you're having trouble. We'll give the community a chance to respond before we circle back around to this. Thanks in advance for your patience.

Hubert-Dudek · ‎01-14-2022

How exactly failing?

Maybe there are differences in csv header including casesensivity so enforceSchema = False could maybe help.

Regarding schema under the hood it points to the same scala function.

jose_gonzalez · ‎02-08-2022

Hi @nafri A ,

What is the error you are getting, can you share it please? Like @Hubert Dudek mentioned, both will call the same APIs

Databricks Community

What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file?

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences