cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Vectorized reading of parquet file containing decimal type column(s)

alm
New Contributor III

I was trying to read a parquet file, and write to a delta table, with a parquet file that contains decimal type columns. I encountered a problem that is pretty neatly described by this kb.databricks article, and which I solved by disabling the vectorized reading as suggested.

So, my problem is solved, what do I really have to complain about? I'm left wondering though, if it's intentional that it is necessary to check the files for decimal types? It seems a bit awkward, so I was wondering if anything is being done about it? If there is an open issue, I would love a link.

Also, if there is an active achitectural decision behind this, I would be very interested in hearing the motivation - out of professional curiosity:)

As a final note, I'm using scala 2.12 and spark 3.3.2.

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Alberte Mørk​ :

The behavior you observed is due to a known issue in Apache Spark when vectorized reading is used with Parquet files that contain decimal type columns. As you mentioned, the issue can be resolved by disabling vectorized reading for the Parquet file(s) in question.

Regarding whether anything is being done about this, I would suggest checking the Apache Spark JIRA for any open issues related to this problem. You can also post a question on the Spark user mailing list to see if there are any updates on this issue.

As for the architectural decision behind this behavior, it is likely related to the fact that decimal type columns are not natively supported by many file formats, including Parquet. As a result, Spark needs to perform some additional processing when reading these columns, which can impact performance. The decision to use vectorized reading for Parquet files by default is likely based on performance considerations for most data types, and the fact that most Parquet files do not contain decimal type columns.

I hope this helps!

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Alberte Mørk​ :

The behavior you observed is due to a known issue in Apache Spark when vectorized reading is used with Parquet files that contain decimal type columns. As you mentioned, the issue can be resolved by disabling vectorized reading for the Parquet file(s) in question.

Regarding whether anything is being done about this, I would suggest checking the Apache Spark JIRA for any open issues related to this problem. You can also post a question on the Spark user mailing list to see if there are any updates on this issue.

As for the architectural decision behind this behavior, it is likely related to the fact that decimal type columns are not natively supported by many file formats, including Parquet. As a result, Spark needs to perform some additional processing when reading these columns, which can impact performance. The decision to use vectorized reading for Parquet files by default is likely based on performance considerations for most data types, and the fact that most Parquet files do not contain decimal type columns.

I hope this helps!

alm
New Contributor III

Thank you!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.