topic Re: reading a tab separated CSV quietly drops empty rows in Data Engineering

reading a tab separated CSV quietly drops empty rows

Martinitus — Fri, 26 Jan 2024 11:03:03 GMT

I already reported that as a Bug to the official Spark bug tracker: https://issues.apache.org/jira/browse/SPARK-46876

A short summary: When reading a tab separated file, that has lines that only contain of tabs, then this line will not show up in the parsed dataframe, instead the data is silently dropped.

I am not sure if this is a spark issue or only occurs in databricks 😞

Re: reading a tab separated CSV quietly drops empty rows

Lakshay — Fri, 26 Jan 2024 13:41:36 GMT

Hi @Martinitus , Thank you for reporting this. It looks like a potential bug. It should be addressed via the JIRA ticket.

Re: reading a tab separated CSV quietly drops empty rows

Martinitus — Mon, 29 Jan 2024 08:58:08 GMT

Yes, did not find official way to report bugs directly to Databricks, but it would be nice if some databricks engineer could open a respective ticket in an internal Jira.
In our case this is pretty much a show stopper for reading data exported from client SAP systems, as the exported data contains 8 header rows some of which are empty (only contain tabs).
What we planned to do was:

read the first header row
skip the remaining 7 header rows

But if one of the headers is empty, it is already skipped by spark and we skip the first row of real data.

Re: reading a tab separated CSV quietly drops empty rows

Martinitus — Wed, 28 Feb 2024 14:42:04 GMT

Hi Databricks team. Someone has fixed the bug and opened a PR on github already a couple of weeks ago. Maybe someone can have a look at this and merge it. Its just a minor fix, but as soon as it is rolled out it will resolve some major issues / obstacles we have on our end caused by this bug. We can then drop a manual powershell script that we currently have to run on user notebooks before the data gets uploaded to the workspace (which is a big PITA to be honest)

https://github.com/apache/spark/pull/44946

Re: reading a tab separated CSV quietly drops empty rows

Martinitus — Tue, 12 Mar 2024 13:54:14 GMT

@Lakshay Do you know any way to speed up the github merge/review process? The issue has a proposed fix since more than 4 weeks now, but no one seems to care...