Databricks Community

berserkersap · ‎08-13-2022

I am dealing with values ranging from 10^9 to 10^-9 , the sum of values can go up to 10^20 and need accuracy.

So I wanted to use Decimal Data type [ Using SQL in Data Science & Engineering workspace].

However, I got to know the peculiar behavior of Decimal Data Type. That is, to change the precision to preserve accuracy.

Decimal Precision Scale

Decimal Precision Loss

Due to this behavior, I am facing issues while multiplying or dividing numbers. [I have a case of series of multiplications and divisions for transforming data]

When I multiply two numbers, the precision changes and when multiply again, there is a loss of accuracy.

For example

SELECT CAST(1234.456 as decimal(20,8))*CAST(23478.89076 as decimal(20,8))*CAST(345678.125567 as decimal(20,8))

The answer I am getting : 10019016421573.927554

Real Answer: 10019016421573.92755433905952

I am looking for scale around 8

Is there any specific way to handle this kind of scenario ?

Note: All the numeric columns in my case will have same data type.

berserkersap · ‎11-18-2022

Hello Everyone,

I understand that there is no best answer for this question.

So, I could only do the same thing I found when I surfed the net.

The method I found works when

If you know the range of values you deal with (not just the input data but also the values you get after transformations like multiplication, addition, division)
You know the precision you want.
The precision + range lies within the decimal datatype range (38,6)

The method is simple, if you want p decimal accuracy, then multiply all the required numeric columns with 10^p and maintain the decimal datatype as decimal(38,6). Since the default decimal precision after exceeding the precision limit in mathematical operations is (38,6), you will not get any change in datatype after operations implemented.

For example, need 6 decimal precision, then multiply the column with 10^6 and do the operations. But do remember that if you multiply two columns then the resulting column will have 10^12 multiplier in it. So, Use ((C1/10^6)*C2) for such operations.

However, it is very crucial to ensure that there is no overflow or loss of precision by adhering to the 3 points above. However, this might be very difficult as we might not know the resulting sum of group. Hence, use the method judicially.

Decimal Module of python has a high flexibility regarding this (C++ has even more robust library) but it cannot be used in pyspark as pyspark uses Spark engine and doesn't support that type.

Please feel free to correct me if I am wrong.