Databricks

irfanaziz · ‎01-13-2022

I have observed a very strange behavior with some of our integration pipelines. This week one of the csv files was getting broken when read with read function given below.

def ReadCSV(files,schema_struct,header,delimiter,timestampformat,encode="utf8",multiLine="true"):
  deltas_df = spark.read \
      .format('csv') \
      .options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine=multiLine) \
      .schema(schema=schema_struct).load(files)  
  return df

I made changes and moved the schema in the options. This worked and was able to read the file for that object. But it started failing for the other objects. So i am wondering why would it behave so differently.

def ReadCSV2(files,schema_struct,header,delimiter,timestampformat,encode="utf8"):
  deltas_df = spark.read \
      .format('csv') \
      .options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine="true",schema=schema_struct) \
      .load(files)  
  return df

I would like to keep one function and solve this issue. For now i have to use two functions.

Hubert-Dudek · ‎01-14-2022

How exactly failing?

Maybe there are differences in csv header including casesensivity so enforceSchema = False could maybe help.

Regarding schema under the hood it points to the same scala function.

View solution in original post

Anonymous · ‎01-13-2022

Hello @nafri A - My name is Piper, and I'm a moderator for Databricks. Welcome to the community and thank you for your question. I'm sorry to hear you're having trouble. We'll give the community a chance to respond before we circle back around to this. Thanks in advance for your patience.

Hubert-Dudek · ‎01-14-2022

How exactly failing?

Maybe there are differences in csv header including casesensivity so enforceSchema = False could maybe help.

Regarding schema under the hood it points to the same scala function.

jose_gonzalez · ‎02-08-2022

Hi @nafri A ,

What is the error you are getting, can you share it please? Like @Hubert Dudek mentioned, both will call the same APIs

Databricks

What is the difference between passing the schema in the options or using the .schema() function in pyspark for a csv file?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs