reading a tab separated CSV quietly drops empty rows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-26-2024 03:03 AM
I already reported that as a Bug to the official Spark bug tracker: https://issues.apache.org/jira/browse/SPARK-46876
A short summary: When reading a tab separated file, that has lines that only contain of tabs, then this line will not show up in the parsed dataframe, instead the data is silently dropped.
I am not sure if this is a spark issue or only occurs in databricks 😞
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-26-2024 05:41 AM
Hi @Martinitus , Thank you for reporting this. It looks like a potential bug. It should be addressed via the JIRA ticket.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-29-2024 12:58 AM
Yes, did not find official way to report bugs directly to Databricks, but it would be nice if some databricks engineer could open a respective ticket in an internal Jira.
In our case this is pretty much a show stopper for reading data exported from client SAP systems, as the exported data contains 8 header rows some of which are empty (only contain tabs).
What we planned to do was:
- read the first header row
- skip the remaining 7 header rows
But if one of the headers is empty, it is already skipped by spark and we skip the first row of real data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-28-2024 06:41 AM - edited 02-28-2024 06:42 AM
Hi Databricks team. Someone has fixed the bug and opened a PR on github already a couple of weeks ago. Maybe someone can have a look at this and merge it. Its just a minor fix, but as soon as it is rolled out it will resolve some major issues / obstacles we have on our end caused by this bug. We can then drop a manual powershell script that we currently have to run on user notebooks before the data gets uploaded to the workspace (which is a big PITA to be honest)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-12-2024 06:54 AM
@Lakshay Do you know any way to speed up the github merge/review process? The issue has a proposed fix since more than 4 weeks now, but no one seems to care...

