I have observed a very strange behavior with some of our integration pipelines. This week one of the csv files was getting broken when read with read function given below.
def ReadCSV(files,schema_struct,header,delimiter,timestampformat,encode="utf8",multiLine="true"):
deltas_df = spark.read \
.format('csv') \
.options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine=multiLine) \
.schema(schema=schema_struct).load(files)
return df
I made changes and moved the schema in the options. This worked and was able to read the file for that object. But it started failing for the other objects. So i am wondering why would it behave so differently.
def ReadCSV2(files,schema_struct,header,delimiter,timestampformat,encode="utf8"):
deltas_df = spark.read \
.format('csv') \
.options(header=header, delimiter=delimiter, timestampFormat=timestampformat,enoding=encode,multiLine="true",schema=schema_struct) \
.load(files)
return df
I would like to keep one function and solve this issue. For now i have to use two functions.