Databricks

alm · ‎04-11-2023

I was trying to read a parquet file, and write to a delta table, with a parquet file that contains decimal type columns. I encountered a problem that is pretty neatly described by this kb.databricks article, and which I solved by disabling the vectorized reading as suggested.

So, my problem is solved, what do I really have to complain about? I'm left wondering though, if it's intentional that it is necessary to check the files for decimal types? It seems a bit awkward, so I was wondering if anything is being done about it? If there is an open issue, I would love a link.

Also, if there is an active achitectural decision behind this, I would be very interested in hearing the motivation - out of professional curiosity:)

As a final note, I'm using scala 2.12 and spark 3.3.2.

Anonymous · ‎04-15-2023

@Alberte Mørk :

The behavior you observed is due to a known issue in Apache Spark when vectorized reading is used with Parquet files that contain decimal type columns. As you mentioned, the issue can be resolved by disabling vectorized reading for the Parquet file(s) in question.

Regarding whether anything is being done about this, I would suggest checking the Apache Spark JIRA for any open issues related to this problem. You can also post a question on the Spark user mailing list to see if there are any updates on this issue.

As for the architectural decision behind this behavior, it is likely related to the fact that decimal type columns are not natively supported by many file formats, including Parquet. As a result, Spark needs to perform some additional processing when reading these columns, which can impact performance. The decision to use vectorized reading for Parquet files by default is likely based on performance considerations for most data types, and the fact that most Parquet files do not contain decimal type columns.

I hope this helps!

View solution in original post

Anonymous · ‎04-15-2023

@Alberte Mørk :

The behavior you observed is due to a known issue in Apache Spark when vectorized reading is used with Parquet files that contain decimal type columns. As you mentioned, the issue can be resolved by disabling vectorized reading for the Parquet file(s) in question.

Regarding whether anything is being done about this, I would suggest checking the Apache Spark JIRA for any open issues related to this problem. You can also post a question on the Spark user mailing list to see if there are any updates on this issue.

As for the architectural decision behind this behavior, it is likely related to the fact that decimal type columns are not natively supported by many file formats, including Parquet. As a result, Spark needs to perform some additional processing when reading these columns, which can impact performance. The decision to use vectorized reading for Parquet files by default is likely based on performance considerations for most data types, and the fact that most Parquet files do not contain decimal type columns.

I hope this helps!