topic What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file? in Data Engineering

What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file?

irfanaziz — Thu, 13 Jan 2022 12:39:22 GMT

I have observed a very strange behavior with some of our integration pipelines. This week one of the csv files was getting broken when read with read function given below.

def ReadCSV(files,schema_struct,header,delimiter,timestampformat,encode="utf8",multiLine="true"):
  deltas_df = spark.read \
      .format('csv') \
      .options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine=multiLine) \
      .schema(schema=schema_struct).load(files)  
  return df

I made changes and moved the schema in the options. This worked and was able to read the file for that object. But it started failing for the other objects. So i am wondering why would it behave so differently.

def ReadCSV2(files,schema_struct,header,delimiter,timestampformat,encode="utf8"):
  deltas_df = spark.read \
      .format('csv') \
      .options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine="true",schema=schema_struct) \
      .load(files)  
  return df

I would like to keep one function and solve this issue. For now i have to use two functions.

Re: What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file?

Anonymous — Thu, 13 Jan 2022 17:49:49 GMT

Hello @nafri A - My name is Piper, and I'm a moderator for Databricks. Welcome to the community and thank you for your question. I'm sorry to hear you're having trouble. We'll give the community a chance to respond before we circle back around to this. Thanks in advance for your patience.

Re: What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file?

Hubert-Dudek — Fri, 14 Jan 2022 14:32:20 GMT

How exactly failing?

Maybe there are differences in csv header including casesensivity so enforceSchema = False could maybe help.

Regarding schema under the hood it points to the same scala function.

Re: What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file?

jose_gonzalez — Wed, 09 Feb 2022 00:41:55 GMT

Hi @nafri A ,

What is the error you are getting, can you share it please? Like @Hubert Dudek mentioned, both will call the same APIs