cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Vectorized reading of parquet file containing decimal type column(s)

alm
New Contributor III

I was trying to read a parquet file, and write to a delta table, with a parquet file that contains decimal type columns. I encountered a problem that is pretty neatly described by this kb.databricks article, and which I solved by disabling the vectorized reading as suggested.

So, my problem is solved, what do I really have to complain about? I'm left wondering though, if it's intentional that it is necessary to check the files for decimal types? It seems a bit awkward, so I was wondering if anything is being done about it? If there is an open issue, I would love a link.

Also, if there is an active achitectural decision behind this, I would be very interested in hearing the motivation - out of professional curiosity:)

As a final note, I'm using scala 2.12 and spark 3.3.2.

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Alberte Mørk​ :

The behavior you observed is due to a known issue in Apache Spark when vectorized reading is used with Parquet files that contain decimal type columns. As you mentioned, the issue can be resolved by disabling vectorized reading for the Parquet file(s) in question.

Regarding whether anything is being done about this, I would suggest checking the Apache Spark JIRA for any open issues related to this problem. You can also post a question on the Spark user mailing list to see if there are any updates on this issue.

As for the architectural decision behind this behavior, it is likely related to the fact that decimal type columns are not natively supported by many file formats, including Parquet. As a result, Spark needs to perform some additional processing when reading these columns, which can impact performance. The decision to use vectorized reading for Parquet files by default is likely based on performance considerations for most data types, and the fact that most Parquet files do not contain decimal type columns.

I hope this helps!

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Alberte Mørk​ :

The behavior you observed is due to a known issue in Apache Spark when vectorized reading is used with Parquet files that contain decimal type columns. As you mentioned, the issue can be resolved by disabling vectorized reading for the Parquet file(s) in question.

Regarding whether anything is being done about this, I would suggest checking the Apache Spark JIRA for any open issues related to this problem. You can also post a question on the Spark user mailing list to see if there are any updates on this issue.

As for the architectural decision behind this behavior, it is likely related to the fact that decimal type columns are not natively supported by many file formats, including Parquet. As a result, Spark needs to perform some additional processing when reading these columns, which can impact performance. The decision to use vectorized reading for Parquet files by default is likely based on performance considerations for most data types, and the fact that most Parquet files do not contain decimal type columns.

I hope this helps!

alm
New Contributor III

Thank you!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group