cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Databricks Free Edition Help
Engage in discussions about the Databricks Free Edition within the Databricks Community. Share insights, tips, and best practices for getting started, troubleshooting issues, and maximizing the value of your trial experience to explore Databricks' capabilities effectively.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Decimal Precision Loss While Reading Parquet Files in Databricks

SantiNath_Dey
Contributor

We are extracting data from SQL Server/Oracle using ADF and storing it in Parquet format. When reading the files in Databricks using spark.read.parquet, decimal values are getting truncatedโ€”for example, 1245.1111111189979 becomes 1245.111111118. This results in a loss of precision. Thanks in Advance.

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @SantiNath_Dey,

When ADF writes the Parquet, it truncates (or rounds) to that scale. Databricks just reads whatever is stored. There is no way for spark.read.parquet to recreate the lost digits.This behaviour is almost certainly coming from how ADF writes the Parquet, not from spark.read.parquet truncating values.

Parquet stores decimals as DECIMAL(precision, scale) with a fixed number of digits after the decimal point. If your source value is 1245.1111111189979 but the Parquet column is, for example, DECIMAL(38, 12), then ADF will truncate/round to 12 fractional digits and physically write 1245.111111118 into the file. When Databricks reads that file, it just returns whatโ€™s stored; thereโ€™s no way for Spark to โ€œrecoverโ€ the discarded digits.

You can confirm this in Databricks:

df = spark.read.parquet("path")
df.select("your_column").printSchema()
df.select("your_column").show(truncate=False)

If you see a decimal type with a smaller scale than the source DB, the precision loss has already happened in ADF.

You can fix this in your ADF Copy activity, go to Sink --> Mapping and explicitly set the target column to a DECIMAL with sufficient precision/scale (for example DECIMAL(38, 18) if supported). If thatโ€™s not possible, have ADF write the value as STRING to Parquet, then in Databricks cast it to a suitable decimal:

from pyspark.sql.functions import col

df = (spark.read.parquet("path")
      .withColumn("your_column",
                  col("your_column").cast("decimal(38,18)")))

Hope this helps.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

2 REPLIES 2

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @SantiNath_Dey,

When ADF writes the Parquet, it truncates (or rounds) to that scale. Databricks just reads whatever is stored. There is no way for spark.read.parquet to recreate the lost digits.This behaviour is almost certainly coming from how ADF writes the Parquet, not from spark.read.parquet truncating values.

Parquet stores decimals as DECIMAL(precision, scale) with a fixed number of digits after the decimal point. If your source value is 1245.1111111189979 but the Parquet column is, for example, DECIMAL(38, 12), then ADF will truncate/round to 12 fractional digits and physically write 1245.111111118 into the file. When Databricks reads that file, it just returns whatโ€™s stored; thereโ€™s no way for Spark to โ€œrecoverโ€ the discarded digits.

You can confirm this in Databricks:

df = spark.read.parquet("path")
df.select("your_column").printSchema()
df.select("your_column").show(truncate=False)

If you see a decimal type with a smaller scale than the source DB, the precision loss has already happened in ADF.

You can fix this in your ADF Copy activity, go to Sink --> Mapping and explicitly set the target column to a DECIMAL with sufficient precision/scale (for example DECIMAL(38, 18) if supported). If thatโ€™s not possible, have ADF write the value as STRING to Parquet, then in Databricks cast it to a suitable decimal:

from pyspark.sql.functions import col

df = (spark.read.parquet("path")
      .withColumn("your_column",
                  col("your_column").cast("decimal(38,18)")))

Hope this helps.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

SantiNath_Dey
Contributor

Thank for your response.