cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

reading a tab separated CSV quietly drops empty rows

Martinitus
New Contributor III

I already reported that as a Bug to the official Spark bug tracker: https://issues.apache.org/jira/browse/SPARK-46876

A short summary: When reading a tab separated file, that has lines that only contain of tabs, then this line will not show up in the parsed dataframe, instead the data is silently dropped.

I am not sure if this is a spark issue or only occurs in databricks ๐Ÿ˜ž

4 REPLIES 4

Lakshay
Databricks Employee
Databricks Employee

Hi @Martinitus , Thank you for reporting this. It looks like a potential bug. It should be addressed via the JIRA ticket.

Martinitus
New Contributor III

Yes, did not find official way to report bugs directly to Databricks, but it would be nice if some databricks engineer could open a respective ticket in an internal Jira.
In our case this is pretty much a show stopper for reading data exported from client SAP systems, as the exported data contains 8 header rows some of which are empty (only contain tabs).
What we planned to do was:

  • read the first header row
  • skip the remaining 7 header rows

But if one of the headers is empty, it is already skipped by spark and we skip the first row of real data.

Martinitus
New Contributor III

Hi Databricks team. Someone has fixed the bug and opened a PR on github already a couple of weeks ago. Maybe someone can have a look at this and merge it. Its just a minor fix, but as soon as it is rolled out it will resolve some major issues / obstacles we have on our end caused by this bug. We can then drop a manual powershell script that we currently have to run on user notebooks before the data gets uploaded to the workspace (which is a big PITA to be honest)

https://github.com/apache/spark/pull/44946

Martinitus
New Contributor III

@Lakshay Do you know any way to speed up the github merge/review process? The issue has a proposed fix since more than 4 weeks now, but no one seems to care...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group