cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

reading a tab separated CSV quietly drops empty rows

Martinitus
New Contributor III

I already reported that as a Bug to the official Spark bug tracker: https://issues.apache.org/jira/browse/SPARK-46876

A short summary: When reading a tab separated file, that has lines that only contain of tabs, then this line will not show up in the parsed dataframe, instead the data is silently dropped.

I am not sure if this is a spark issue or only occurs in databricks 😞

4 REPLIES 4

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Martinitus , Thank you for reporting this. It looks like a potential bug. It should be addressed via the JIRA ticket.

Martinitus
New Contributor III

Yes, did not find official way to report bugs directly to Databricks, but it would be nice if some databricks engineer could open a respective ticket in an internal Jira.
In our case this is pretty much a show stopper for reading data exported from client SAP systems, as the exported data contains 8 header rows some of which are empty (only contain tabs).
What we planned to do was:

  • read the first header row
  • skip the remaining 7 header rows

But if one of the headers is empty, it is already skipped by spark and we skip the first row of real data.

Martinitus
New Contributor III

Hi Databricks team. Someone has fixed the bug and opened a PR on github already a couple of weeks ago. Maybe someone can have a look at this and merge it. Its just a minor fix, but as soon as it is rolled out it will resolve some major issues / obstacles we have on our end caused by this bug. We can then drop a manual powershell script that we currently have to run on user notebooks before the data gets uploaded to the workspace (which is a big PITA to be honest)

https://github.com/apache/spark/pull/44946

Martinitus
New Contributor III

@Lakshay Do you know any way to speed up the github merge/review process? The issue has a proposed fix since more than 4 weeks now, but no one seems to care...

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.