cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Not able to retain precision while reading data from source file

nikhil_kumawat
New Contributor

Hi, 

I am trying to read a csv file located in S3 bucket folder. The csv file contains around 50 columns out of which one of the column is "litre_val" which contains values like "60211.952", "59164.608'. Upto 3 decimal points. 

Now to read this csv we are using spark API like below:

spark_df = spark.read.format("csv").option("header", "true").load(f"s3://{s3_bucket}/{folder}/{filename}.csv")

After reading the file and reading all the columns as string, just to retain the precision, when we try to display the data, it reads only upto 2 decimal points like belowprecision.png

All the values are being read upto 2 decimal points only. The data type is string and it should retain the precision. Even there is one more column which contains upto 4 decimal points. That is also being read upto two decimal point as string data type.

Can someone suggest how to retain precision in this case ?

Any help would highly be appreciated.

8 REPLIES 8

Walter_C
Databricks Employee
Databricks Employee

Can you please try with something like:

from pyspark.sql.types import StructType, StructField, StringType, DecimalType

# Define the schema with the appropriate precision and scale for decimal columns
schema = StructType([
    StructField("column1", StringType(), True),
    StructField("column2", StringType(), True),
    # Add other columns as needed
    StructField("litre_val", DecimalType(precision=10, scale=3), True),
    StructField("another_decimal_column", DecimalType(precision=10, scale=4), True)
    # Add other columns as needed
])

# Read the CSV file using the defined schema
spark_df = spark.read.format("csv") \
    .option("header", "true") \
    .schema(schema) \
    .load(f"s3://{s3_bucket}/{folder}/{filename}.csv")

# Display the DataFrame to verify the precision
spark_df.show()

In this example, replace "column1", "column2", etc., with the actual column names from your CSV file. The DecimalType(precision=10, scale=3) specifies that the litre_val column should be read as a decimal with a precision of 10 and a scale of 3. Adjust the precision and scale values as needed for your specific use case.

Hi @Walter_C 

Thanks for the reply. I can specify the schema. But that would not be an ideal scenario in our case. The reason being:

1. We have around 20 different sources to read data from. Each source has different column having such values. And the precision can be different as well. For some columns it can be upto 3 decimal points and while for others it can have upto 5 decimal points. 

2. The schema can change as per time. 

 

So, I need a generic solution where I can read the data, retaining the precision, in the bronze layer and then can apply the schema in the silver layer. That's why I thought of reading it as a string.

Is there any other way where we can retain precision ?

NandiniN
Databricks Employee
Databricks Employee

eg:

from pyspark.sql.functions import format_number, col

df = spark.read.parquet("<your-parquet-file-path>")

df = df.withColumn("<formatted-column>", format_number(col("<your-decimal-column>"), 2))

 

Doc - https://docs.databricks.com/en/sql/language-manual/functions/format_number.html

NandiniN
Databricks Employee
Databricks Employee

But I agree with @Walter_C on specifying the schema and making sure String type does not cause any truncation.

szymon_dybczak
Esteemed Contributor III

I wonder if this is not strange display behavior. Can you use the show method on the dataframe instead of display and see the result? Or save the dataframe as a parquet and see what the column looks like after saving.

VZLA
Databricks Employee
Databricks Employee

Agree with @szymon_dybczak, If the columns are read as strings, Spark doesn’t lose any precision or decimal places. It might simply be how data is shown in the console or Databricks UI. To see all decimals, disable truncation in .show() or select the column directly in the UI. As szymon suggested an easy check is to store it somewhere else and then check the persisted data directly.

E.g.: df.select("litre_val").show(truncate=False) or display(df.select("litre_val"))

You’ll then see the full value. If you later need proper numeric types with guaranteed precision, apply a DecimalType in the silver layer or cast them at that stage

 

nikhil_kumawat
New Contributor

Hi @VZLA and @szymon_dybczak 

Yes I did that already. Stored the dataframe as table in databricks and then displayed the content like below:

nikhil_kumawat_0-1735870355437.png

And the datatype is string:

nikhil_kumawat_1-1735870409146.png

 

VZLA
Databricks Employee
Databricks Employee

@nikhil_kumawat can you provide more details to reproduce this and better help you? e.g.: sample data set, dbr version, reproducer code, etc.

I'm having this sample data:

csv_content = """column1,column2,litre_val,another_decimal_column
1,TypeA,60211.952,12.3459
2,TypeB,59164.608,45.6789
3,TypeC,12345.678,78.9012
"""

Which I'm then storing as csv file in my dbfs temp location. Then I'm reading it back without a schema, but simple inference:

# Reading the CSV file without using any schema
df = spark.read.format("csv").option("header", "true").load("/some/path/to/test_data.csv")

And when displaying it using:

df.show(truncate=False)
df.printSchema()

I'm seeing the results as:

+-------+-------+---------+----------------------+
|column1|column2|litre_val|another_decimal_column|
+-------+-------+---------+----------------------+
|1      |TypeA  |60211.952|12.3459               |
|2      |TypeB  |59164.608|45.6789               |
|3      |TypeC  |12345.678|78.9012               |
+-------+-------+---------+----------------------+

root
 |-- column1: string (nullable = true)
 |-- column2: string (nullable = true)
 |-- litre_val: string (nullable = true)
 |-- another_decimal_column: string (nullable = true)

Using display(), does not alter the results:

Screenshot 2025-01-03 at 11.20.27.png

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group