cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file?

irfanaziz
Contributor II

I have observed a very strange behavior with some of our integration pipelines. This week one of the csv files was getting broken when read with read function given below.

def ReadCSV(files,schema_struct,header,delimiter,timestampformat,encode="utf8",multiLine="true"):
  deltas_df = spark.read \
      .format('csv') \
      .options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine=multiLine) \
      .schema(schema=schema_struct).load(files)  
  return df

I made changes and moved the schema in the options. This worked and was able to read the file for that object. But it started failing for the other objects. So i am wondering why would it behave so differently.

def ReadCSV2(files,schema_struct,header,delimiter,timestampformat,encode="utf8"):
  deltas_df = spark.read \
      .format('csv') \
      .options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine="true",schema=schema_struct) \
      .load(files)  
  return df

I would like to keep one function and solve this issue. For now i have to use two functions.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

How exactly failing?

Maybe there are differences in csv header including casesensivity so enforceSchema = False could maybe help.

Regarding schema under the hood it points to the same scala function.

View solution in original post

3 REPLIES 3

Anonymous
Not applicable

Hello @nafri A​ - My name is Piper, and I'm a moderator for Databricks. Welcome to the community and thank you for your question. I'm sorry to hear you're having trouble. We'll give the community a chance to respond before we circle back around to this. Thanks in advance for your patience.

Hubert-Dudek
Esteemed Contributor III

How exactly failing?

Maybe there are differences in csv header including casesensivity so enforceSchema = False could maybe help.

Regarding schema under the hood it points to the same scala function.

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @nafri A​ ,

What is the error you are getting, can you share it please? Like @Hubert Dudek​ mentioned, both will call the same APIs

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group