cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file?

irfanaziz
Contributor II

I have observed a very strange behavior with some of our integration pipelines. This week one of the csv files was getting broken when read with read function given below.

def ReadCSV(files,schema_struct,header,delimiter,timestampformat,encode="utf8",multiLine="true"):
  deltas_df = spark.read \
      .format('csv') \
      .options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine=multiLine) \
      .schema(schema=schema_struct).load(files)  
  return df

I made changes and moved the schema in the options. This worked and was able to read the file for that object. But it started failing for the other objects. So i am wondering why would it behave so differently.

def ReadCSV2(files,schema_struct,header,delimiter,timestampformat,encode="utf8"):
  deltas_df = spark.read \
      .format('csv') \
      .options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine="true",schema=schema_struct) \
      .load(files)  
  return df

I would like to keep one function and solve this issue. For now i have to use two functions.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

How exactly failing?

Maybe there are differences in csv header including casesensivity so enforceSchema = False could maybe help.

Regarding schema under the hood it points to the same scala function.

View solution in original post

3 REPLIES 3

Anonymous
Not applicable

Hello @nafri A​ - My name is Piper, and I'm a moderator for Databricks. Welcome to the community and thank you for your question. I'm sorry to hear you're having trouble. We'll give the community a chance to respond before we circle back around to this. Thanks in advance for your patience.

Hubert-Dudek
Esteemed Contributor III

How exactly failing?

Maybe there are differences in csv header including casesensivity so enforceSchema = False could maybe help.

Regarding schema under the hood it points to the same scala function.

jose_gonzalez
Moderator
Moderator

Hi @nafri A​ ,

What is the error you are getting, can you share it please? Like @Hubert Dudek​ mentioned, both will call the same APIs

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.