Hello Everyone,
I understand that there is no best answer for this question.
So, I could only do the same thing I found when I surfed the net.
The method I found works when
- If you know the range of values you deal with (not just the input data but also the values you get after transformations like multiplication, addition, division)
- You know the precision you want.
- The precision + range lies within the decimal datatype range (38,6)
The method is simple, if you want p decimal accuracy, then multiply all the required numeric columns with 10^p and maintain the decimal datatype as decimal(38,6). Since the default decimal precision after exceeding the precision limit in mathematical operations is (38,6), you will not get any change in datatype after operations implemented.
For example, need 6 decimal precision, then multiply the column with 10^6 and do the operations. But do remember that if you multiply two columns then the resulting column will have 10^12 multiplier in it. So, Use ((C1/10^6)*C2) for such operations.
However, it is very crucial to ensure that there is no overflow or loss of precision by adhering to the 3 points above. However, this might be very difficult as we might not know the resulting sum of group. Hence, use the method judicially.
Decimal Module of python has a high flexibility regarding this (C++ has even more robust library) but it cannot be used in pyspark as pyspark uses Spark engine and doesn't support that type.
Please feel free to correct me if I am wrong.