Databricks Community

Martinitus · ‎01-26-2024

I already reported that as a Bug to the official Spark bug tracker: https://issues.apache.org/jira/browse/SPARK-46876

A short summary: When reading a tab separated file, that has lines that only contain of tabs, then this line will not show up in the parsed dataframe, instead the data is silently dropped.

I am not sure if this is a spark issue or only occurs in databricks 😞

Lakshay · ‎01-26-2024

Hi @Martinitus , Thank you for reporting this. It looks like a potential bug. It should be addressed via the JIRA ticket.

Martinitus · ‎01-29-2024

Yes, did not find official way to report bugs directly to Databricks, but it would be nice if some databricks engineer could open a respective ticket in an internal Jira.
In our case this is pretty much a show stopper for reading data exported from client SAP systems, as the exported data contains 8 header rows some of which are empty (only contain tabs).
What we planned to do was:

read the first header row
skip the remaining 7 header rows

But if one of the headers is empty, it is already skipped by spark and we skip the first row of real data.

Martinitus · ‎02-28-2024

Hi Databricks team. Someone has fixed the bug and opened a PR on github already a couple of weeks ago. Maybe someone can have a look at this and merge it. Its just a minor fix, but as soon as it is rolled out it will resolve some major issues / obstacles we have on our end caused by this bug. We can then drop a manual powershell script that we currently have to run on user notebooks before the data gets uploaded to the workspace (which is a big PITA to be honest)

https://github.com/apache/spark/pull/44946

Martinitus · ‎03-12-2024

@Lakshay Do you know any way to speed up the github merge/review process? The issue has a proposed fix since more than 4 weeks now, but no one seems to care...

Databricks Community

reading a tab separated CSV quietly drops empty rows

Photos

Connect with Databricks Users in Your Area

Data + AI Summit 2025 — registration now open!

Jumpstart Your Data Journey with Databricks Get Started Days!

Databricks DevConnect: Global Community Meetups for Data Engineers

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Introducing SAP Databricks