topic Re: Not able to retain precision while reading data from source file in Data Engineering

Not able to retain precision while reading data from source file

nikhil_kumawat — Thu, 02 Jan 2025 04:17:07 GMT

Hi,

I am trying to read a csv file located in S3 bucket folder. The csv file contains around 50 columns out of which one of the column is "litre_val" which contains values like "60211.952", "59164.608'. Upto 3 decimal points.

Now to read this csv we are using spark API like below:

spark_df = spark.read.format("csv").option("header", "true").load(f"s3://{s3_bucket}/{folder}/{filename}.csv")

After reading the file and reading all the columns as string, just to retain the precision, when we try to display the data, it reads only upto 2 decimal points like below

All the values are being read upto 2 decimal points only. The data type is string and it should retain the precision. Even there is one more column which contains upto 4 decimal points. That is also being read upto two decimal point as string data type.

Can someone suggest how to retain precision in this case ?

Any help would highly be appreciated.

Re: Not able to retain precision while reading data from source file

Walter_C — Thu, 02 Jan 2025 12:13:38 GMT

Can you please try with something like:

from pyspark.sql.types import StructType, StructField, StringType, DecimalType # Define the schema with the appropriate precision and scale for decimal columns schema = StructType([ StructField("column1", StringType(), True), StructField("column2", StringType(), True), # Add other columns as needed StructField("litre_val", DecimalType(precision=10, scale=3), True), StructField("another_decimal_column", DecimalType(precision=10, scale=4), True) # Add other columns as needed ]) # Read the CSV file using the defined schema spark_df = spark.read.format("csv") \ .option("header", "true") \ .schema(schema) \ .load(f"s3://{s3_bucket}/{folder}/{filename}.csv") # Display the DataFrame to verify the precision spark_df.show()

In this example, replace "column1", "column2", etc., with the actual column names from your CSV file. The DecimalType(precision=10, scale=3) specifies that the litre_val column should be read as a decimal with a precision of 10 and a scale of 3. Adjust the precision and scale values as needed for your specific use case.

Re: Not able to retain precision while reading data from source file

nikhil_kumawat — Thu, 02 Jan 2025 12:29:44 GMT

Hi @Walter_C

Thanks for the reply. I can specify the schema. But that would not be an ideal scenario in our case. The reason being:

1. We have around 20 different sources to read data from. Each source has different column having such values. And the precision can be different as well. For some columns it can be upto 3 decimal points and while for others it can have upto 5 decimal points.

2. The schema can change as per time.

So, I need a generic solution where I can read the data, retaining the precision, in the bronze layer and then can apply the schema in the silver layer. That's why I thought of reading it as a string.

Is there any other way where we can retain precision ?

Re: Not able to retain precision while reading data from source file

NandiniN — Thu, 02 Jan 2025 13:48:32 GMT

eg:

from pyspark.sql.functions import format_number, col

df = spark.read.parquet("<your-parquet-file-path>")

df = df.withColumn("<formatted-column>", format_number(col("<your-decimal-column>"), 2))

Doc - https://docs.databricks.com/en/sql/language-manual/functions/format_number.html

Re: Not able to retain precision while reading data from source file

NandiniN — Thu, 02 Jan 2025 13:52:56 GMT

But I agree with @Walter_C on specifying the schema and making sure String type does not cause any truncation.

Re: Not able to retain precision while reading data from source file

szymon_dybczak — Thu, 02 Jan 2025 16:44:51 GMT

I wonder if this is not strange display behavior. Can you use the show method on the dataframe instead of display and see the result? Or save the dataframe as a parquet and see what the column looks like after saving.

Re: Not able to retain precision while reading data from source file

VZLA — Thu, 02 Jan 2025 17:17:23 GMT

Agree with @szymon_dybczak, If the columns are read as strings, Spark doesn’t lose any precision or decimal places. It might simply be how data is shown in the console or Databricks UI. To see all decimals, disable truncation in .show() or select the column directly in the UI. As szymon suggested an easy check is to store it somewhere else and then check the persisted data directly.

E.g.: df.select("litre_val").show(truncate=False) or display(df.select("litre_val"))

You’ll then see the full value. If you later need proper numeric types with guaranteed precision, apply a DecimalType in the silver layer or cast them at that stage

Re: Not able to retain precision while reading data from source file

nikhil_kumawat — Fri, 03 Jan 2025 02:13:52 GMT

Hi @VZLA and @szymon_dybczak

Yes I did that already. Stored the dataframe as table in databricks and then displayed the content like below:

And the datatype is string:

Re: Not able to retain precision while reading data from source file

VZLA — Fri, 03 Jan 2025 10:22:01 GMT

@nikhil_kumawat can you provide more details to reproduce this and better help you? e.g.: sample data set, dbr version, reproducer code, etc.

I'm having this sample data:

csv_content = """column1,column2,litre_val,another_decimal_column 1,TypeA,60211.952,12.3459 2,TypeB,59164.608,45.6789 3,TypeC,12345.678,78.9012 """

Which I'm then storing as csv file in my dbfs temp location. Then I'm reading it back without a schema, but simple inference:

# Reading the CSV file without using any schema df = spark.read.format("csv").option("header", "true").load("/some/path/to/test_data.csv")

And when displaying it using:

df.show(truncate=False) df.printSchema()

I'm seeing the results as:

+-------+-------+---------+----------------------+ |column1|column2|litre_val|another_decimal_column| +-------+-------+---------+----------------------+ |1 |TypeA |60211.952|12.3459 | |2 |TypeB |59164.608|45.6789 | |3 |TypeC |12345.678|78.9012 | +-------+-------+---------+----------------------+ root |-- column1: string (nullable = true) |-- column2: string (nullable = true) |-- litre_val: string (nullable = true) |-- another_decimal_column: string (nullable = true)

Using display(), does not alter the results: